Cursors can look like shortcuts to a developer. When you have a complex job to perform and you need to manipulate the rows in a table, the quickest way may seem to iterate through the rows one by one using a Transact-SQL cursor. After all, since you have to iterate through data structures in your own code on the client side, you may be tempted to do the same when you’re dealing with SQL Server data. But iterating through data using Transact-SQL cursors often does not scale well, and I hope to convince you that it’s also not a good design or architectural practice.
A Cursor Experience
I bring this up because a few months ago, I had to deal with a vendor’s Transact-SQL script that upgraded their database component to a new version of the vendor’s application. They designed the script to pivot a very large table and store the relevant data in new table horizontally, as concatenated strings. The vendor wanted to improve performance by making the table smaller, so they decided to store the detail data horizontally, as comma-delimited strings for each parent id. The client application could query the resulting comma-delimited strings faster than getting each of them as individual rows, and in the context, the change made sense and did improve the application’s performance.
However, the vendor’s Transact-SQL script to pivot the data during the upgrade took 16 hours to run on a test machine, and the customer could not afford more than a few hours of downtime for the upgrade. When we examined the vendor’s script, we saw that the developer had coded the pivoting process in two steps: a cursor to iterate through all of the parent table ids to build a blank pre-formatted table, and then another script to concatenate the strings, again using a cursor.
By using a set-based approach, we were able to reduce the processing time from 16-plus hours down to less than five minutes. We followed the developer’s original strategy, building the blank table using SELECT statements, and we reduced the time for that step to less than two minutes. We then concatenated the strings using an UPDATE statement, executed per parent id. Our iteration through the parent ids used a WHILE loop, and finished in less than three minutes.
The Inevitability of Iteration
Many accesses to database data must be iterative in some fashion in order to prepare the data for further manipulation. Even the SQL Server engine iterates through data when it scans or joins data using the various types of joins available to it. You can see this when you examine the SQL Server query plan for a query that returns many rows from a large data set. For a join, you’ll most commonly see a nested loop, but sometimes also a merge or hash join. For simpler queries, you may see a clustered or non-clustered index scan. It’s only in the cases where SQL Server can return a single row or small set of rows, and the table has an appropriate index, that you’ll see a seek using an index.
Think about it: Microsoft has optimized and tuned the SQL Server engine for years to iterate through its available data as efficiently as possible. Imagine, if you had the time and were willing to spend the energy, you could probably write low-level accesses to database data files that would be pretty efficient. However, it would be efficient only for the individual task in front of you, and you’d have to debug it and might have to completely rewrite it if the scope of your data access were to change. It would probably take you years to really get the code fully optimized and generalized, and even then you wouldn’t be close to the efficiency of the code inside the SQL Server storage engine.
So where’s the gain in re-inventing the wheel? It’s just because the SQL Server engine is so well optimized and debugged, that it’s better to let it do the iterating for you and take advantage of the extensive development and testing that’s already embedded in the database.
If you look at your data processing tasks more closely, I think you’ll find that there are really very few occasions where cursors are required. First of all, often you can accomplish your goal by relying on the set-based SQL commands in Transact-SQL, and ignoring the order of a table’s rows. Second, Transact-SQL cursors are just one way to iterate through a table row by row. If you can uniquely identify every row of a table that you must iterate, you can use a WHILE loop rather than a cursor, and potentially gain better performance. Let me walk you through an example to show you why.
Comparing Iteration Strategies
Assume you can uniquely identify each row of a table because the table has a unique key or unique group of columns. In a WHILE loop, all you need to do is find the lowest value of the unique condition, and then find the next highest value each time you iterate. Here’s an example from the SQL Server 2005 AdventureWorks sample databases Production.TransactionHistory table. It has a clustered index on the primary key, and the WHILE loop can seek into the row each time.
USE AdventureWorks GO DECLARE @TransactionID int , @TransactionType nchar(1) , @Quantity int SET @TransactionID = (SELECT MIN(TransactionID) FROM Production.TransactionHistory) WHILE @TransactionID IS NOT NULL BEGIN SET @TransactionID = (SELECT MIN(TransactionID) FROM Production.TransactionHistory WHERE TransactionID > @TransactionID) END
Here’s the same loop using a FAST FORWARD cursor, which is the most efficient type of Transact-SQL cursor for just reading data:
DECLARE @TransactionID int , @TransactionType nchar(1) , @Quantity int DECLARE AW_Cursor CURSOR FORWARD_ONLY FOR SELECT TransactionID, TransactionType, Quantity FROM Production.TransactionHistory OPEN AW_Cursor FETCH NEXT FROM AW_Cursor INTO @TransactionID, @TransactionType, @Quantity WHILE @@FETCH_STATUS = 0 BEGIN FETCH NEXT FROM AW_Cursor INTO @TransactionID, @TransactionType, @Quantity END CLOSE AW_Cursor DEALLOCATE AW_Cursor
On my laptop, after I ran it a few times to make sure the data is all in cache, the WHILE loop takes nine seconds and the cursor takes 17 seconds. Your own durations may vary. Note that even though the example really does nothing with the data, the WHILE loop is speedier. The cursor evidently adds more overhead.
The cursor also requires additional commands, which make the code look cluttered. Without getting into the details of how cursors work, which Microsoft explains fully in Microsoft SQL Server 2005 Books Online, notice that when you use a WHILE loop, there is no requirement to declare, open, close, and deallocate anything. The logic is simpler, and you can even update rows freely along the way. To update the rows using the cursor, you will have to change the cursor type.
Even a WHILE loop adds the overhead of iteration. You may be able to replace it with a set-based SELECT command, or replace any updates you wanted to do in your loop with the set-based UPDATE command, and leave the iterating to the SQL Server engine. A simple SELECT statement to get the same data as our cursor and WHILE loop above takes less than 3 seconds, and it returns the rows to the client, which is more work than the two previous loops do.
SELECT * FROM Production.TransactionHistory
This SELECT relies on SQL Server to iterate through the data, and is by far the fastest of the three methods of data access we've looked at.
From Bags to Sets
Sometimes cursors might seem to be necessary. When you simply must iterate through database data, row by row, in their physical order, sometimes only a cursor will work. This most commonly happens when you have duplicate rows and there is no way to uniquely identify a given row in the table. These tables are bags, not sets, of data, as a ‘bag’ does not eliminate duplicate values, like a set does.
Such bags of data usually occur when you import data from an external source and you cannot completely trust the data. For example, if our AdventureWorks transaction history table had no group of columns that you could call unique, and/or had duplicate rows, you might think that you must use a cursor.
However, you can always turn a bag of rows into a normalized table. Even if you have duplicate rows in a table, or no set of columns you can rely on for uniqueness, you can add an identity column to the table and seed the identity to start numbering with 1. This adds a unique key to the table, enabling you to use a WHILE loop instead of a cursor. Once you have a unique key, you can remove duplicates using the Transact-SQL set-based UPDATE command.
The Logical API to Database Data
Using set-base operations is better than iterating the data yourself in at least two ways.
First, set-based SQL commands are more efficient because you’re using SQL Server’s highly optimized engine to do your iteration. If you iterate through data yourself, you’re not using the SQL Server storage engine optimally. Instead, you are peppering it with commands to retrieve just a single row at a time. Each time you request a single row, your command must go through the SQL Server optimizer before it can get to the storage engine, and you end up not using the SQL Server storage engine’s optimized code. If you iterated yourself, you also are relying on extraneous physical information about the table, namely the order of the rows, when processing the data. The set-base Transact-SQL SELECT, UPDATE, and DELETE commands give you a way to ignore the order of the rows and just affect them based on the characteristics of the data-and they're faster.
Second, set-based commands are more logical because thinking about data in sets abstracts you away from extraneous details that are more concerned with how the data is actually ordered. In fact, set-based commands like SELECT, UPDATE, and DELETE, when applied to tables directly and not in a cursor or WHILE loop, bring you closer logically to your data, precisely because you can ignore the order of the data.
Here’s another way to think about this second point-Just as stored procedures are the most natural API for applications to interface with SQL Server programmatically, so set-based SQL commands are the appropriate API for accessing relational data. Stored procedures decouple your application from database internals, and they are more efficient than ad hoc queries. Similarly, the set-base SQL commands inside Transact-SQL give you a logical interface to your relational data, and they are more efficient because you rely on the SQL Server storage engine for iterating through data.
The bottom line is not that iterating through data is bad. Actually, often it’s unavoidable. Rather, the point is, let the storage engine do it for you and rely instead on the logical interface of the set-based Transact-SQL commands. I think you’ll find few if any situations where you must actually use a Transact-SQL cursor.