Many T-SQL query problems involve matching rows based on their location in a sequence—typically, a temporal sequence. A classic example of such a problem is matching each customer’s current and preceding orders to compare measures and analyze trends. There are several possible approaches to a solution. Here, I compare the performance and I/O costs of a cursor solution and two set-based solutions for matching current and preceding rows and share some query optimizer tips along the way.
Exploring Customer Order Trends
To demonstrate the three solutions, I use the SalesOrderHeader table in the AdventureWorks database. The query in Listing 1 returns the columns of interest for our problem, as Web Table 1 shows in abbreviated form.
To determine trends in customer orders, suppose the sales manager asks you to return the following information for each order:
- current customer ID
- current order date
- current sales order ID
- current total due
- the elapsed time in days between the order dates for the customer’s preceding and current orders (call it Elapsed)
- the difference in total due for the current and preceding orders (call it Diff)
- the percentage difference in total due for the current and preceding orders (call it Pct)
To determine the preceding order, your code should consider OrderDate as the first sort column and SalesOrderID as the second sort column. A customer can have more than one order with the same OrderDate, and SalesOrderID sorting might not reflect OrderDate sorting. Table 1 shows the desired result (in abbreviated form) for this task.
Before you start developing your solution, create the following covering index to speed up your code:
CREATE UNIQUE INDEX idx_cid_od_oid_i_td ON Sales.SalesOrderHeader (CustomerID, OrderDate, SalesOrderID) INCLUDE(TotalDue);
Note that I defined the TotalDue column as an included non-key column in the index because it doesn’t serve any sorting or filtering purpose—it’s included just for covering purposes.
You can use the SalesOrderHeader table in the AdventureWorks database to check the logic and accuracy of your solution by comparing the result you get with the desired result shown in Table 1. However, this table is too small for performance tests because it contains only 31,465 rows. For performance testing, run the code in Web Listing 1 to create in tempdb a SalesOrderHeader table that contains 1 million rows. Make sure you set the database context to AdventureWorks when checking the validity of your solution and to tempdb when testing performance.
All the performance measures I present are based on the larger table in tempdb. My performance measures also focus on server processing time by excluding the time it takes to generate the output in SQL Server Management Studio (SSMS). To exclude SSMS output time, navigate to Tools, Options, Query Results, SQL Server, Results to Grid (or Results to Text), and enable the Discard results after execution option.
Listing 2 shows a straightforward cursorbased solution for the task. The cursor is based on a query that sorts orders by CustomerID, OrderDate, and SalesOrderID. The code scans the cursor rows in this order and calculates Elapsed, Diff, and Pct values by comparing measures from the current row with those from the preceding row (except when the order is the customer’s first). For each order, the code inserts a row with the current order’s attributes and calculated values into a table variable. The code inserts NULL for the calculated column values when the order is the first for the customer (i.e., when the current customer ID is different than the previous customer ID or when the current customer ID is NULL).
This solution scans the leaf level of the covering index idx_cid_od_oid_i_td once in an ordered, forward fashion. The index’s leaf level consists of 3733 pages. In I/O cost, the solution is very efficient because the code must scan all orders at least once anyway. However, cursors have a high performance overhead because of their record-by-record manipulation. For example, this solution ran in about 58 seconds on my system (against the table with 1 million rows and without considering the time it takes SSMS to generate the output). Two main elements in this cursor solution contribute to the lengthy runtime: the overhead of the record-by-record manipulation and the need to spool the calculated values in a table variable. You probably want to start exploring set-based approaches to find a faster solution.
Set-based solutions don’t incur the overhead of the cursor solution’s record-by-record manipulation, and they don’t require you to store the intermediate calculations in a temporary table. In addition, set-based solutions let the query optimizer generate multiple execution plans, from which it can select the most efficient plan; with cursor code, you force a very rigid plan. Still, cursor-based solutions have one important advantage over set-based solutions: They can rely on sorted data. Consequently, cursor-based solutions typically incur less I/O than set-based solutions for the same task.
In my experience, set-based solutions tend to be faster than cursor-based solutions. The first set-based solution I come up with for a given problem often isn’t the fastest solution-sometimes it’s even slower than the cursor-based solution. But with some tweaking, I usually can develop a set-based solution that’s dramatically faster than the fastest cursor solution I can devise. Such is the case here.
Subquery Using TOP with OR Logic
Listing 3 shows my first set-based solution to the customer-order task.The query joins two instances of SalesOrderHeader: one representing the current row (aliased as C) and one representing the preceding row (aliased as P). The join condition matches the Sales-OrderID from C with the SalesOrderID from P. To obtain the preceding order’s Sales-OrderID, the join condition uses the subquery at callout A in Listing 3.
The code issues this subquery against another instance of SalesOrderHeader, aliased as PS.The subquery filters the orders that have the same CustomerID as the current order and either a smaller OrderDate or the same OrderDate and a smaller Sales-OrderID than in the current order. It’s as if you were logically concatenating OrderDate and OrderID and checking that the concatenated values in PS are smaller than those in C.
All orders that the customer made before the current order in C meet the subquery conditions. The subquery returns TOP (1) SalesOrderID based on the ORDER BY list: PS.OrderDate DESC, PS.SalesOrderID DESC. This order ensures that the subquery returns the SalesOrderID value of the most recent order out of all orders that qualify.
As I mentioned earlier, the first set-based solution I come up with isn’t always the fastest solution. This solution ran in about 114 seconds on my system—nearly twice as long as the cursor solution. And the I/O cost for this query was 6,362,250 logical reads.
Now, in SSMS, examine the execution plan for this query, which should look like Figure 1. The Index Scan operator scans all rows in the covering index idx_cid_od_oid _i_td to get all current orders. For each row, a Nested Loops operator initiates an Index Seek operator against idx_cid_od_oid_i_td to fetch the SalesOrderID of the preceding order. Because the subquery uses TOP (1), this seek operator returns only one row for each invocation. Then,another Nested Loops operator initiates a Clustered Index Seek operator to fetch the preceding order’s attributes from the clustered index on SalesOrderID.
In query tuning, you typically focus on the parts of the plan estimated to be the most expensive. According to the different operators’ cost ratios in this query plan, most of the execution cost is associated with the Clustered Index Seek operator (91 percent). In this case, however, the Clustered Index Seek operation is merely a lookup of the data row from the clustered index based on the SalesOrderID of the preceding order, and you can’t avoid this operation unless you forgo this solution and develop one that uses a completely different approach. However, the costing formulas use estimates, and sometimes parts of the query processing that are expensive in practice aren’t apparent in the plan, as in this case.
Subquery Using TOP with AND Logic
Something else to notice in this execution plan is the Index Seek operation against idx_cid _od_oid_i_td, even though the optimizer estimates its cost at only 8 percent. This operation’s purpose is essentially to fulfill the subquery’s task, fetching the SalesOrderID of the preceding order for each current order. To explore this operation a little more, in the execution plan in SSMS, place your mouse pointer over the Seek operator against idx_cid _od_oid_i_td and examine the yellow tooltip box. (You can get the same information through the operator’s Properties window.) The tooltip box shows the following Seek Predicates:
Prefix: \[tempdb\].\[Sales\]. \[SalesOrderHeader\]. CustomerID = \[tempdb\].\[Sales\]. \[SalesOrderHeader\]. \[CustomerID\] as \[C\].\[CustomerID\]
The tooltip also shows the Predicate (aka WHERE Predicate) you see in Figure 2. SQL Server 2005 Books Online (BOL) explains the difference between the Seek Predicates and the WHERE Predicate: “The storage engine uses the index to process only those rows that satisfy the SEEK:() predicate. It optionally may include a WHERE:() predicate, which the storage engine will evaluate against all rows that satisfy the SEEK:() predicate (it does not use the indexes to do this).” In other words, SQL Server uses the Seek Predicates to traverse the index B-Tree from root to leaf level and to perform the partial ordered scan at the leaf level to find matches, then evaluates the WHERE Predicate against all matching rows that the Seek Predicates processed.
In our example, only the CustomerID match between C and PS appears in the Seek Predicates. But the solution invokes this Index Seek operation for each current order—that is, 1 million times. In addition, the Seek Predicates will find about 200 matching rows in each invocation. (I populated the large SalesOrderHeader table with 5000 customers averaging 200 orders apiece.) Therefore, if you cause the optimizer to evaluate parts of the logical expression in the Seek Predicates instead of in the WHERE Predicate, you might be able to use the index to evaluate the WHERE Predicate against fewer rows.
With this in mind, let’s rewrite the subquery using a logically equivalent expression that’s more convenient for the optimizer to execute and that causes the Seek Predicates to do more logical processing. You might already know that an expression in the form
You can transform this expression to a logically equivalent one that uses an AND operator:
Listing 4 shows the query that includes this revised subquery filter. This query ran in about 17 seconds on my system. That’s about six times as fast as the preceding query and more than three times the speed of the cursor-based solution. The total I/O cost of this plan is 5,992,708 logical reads—lower than the I/O cost of Listing 3’s query (6,362,250 logical reads), but not substantially so.
Now, look at the execution plan that Figure 3, shows. As you can see, the plans for Listing 3’s and Listing 4’s solutions look similar on the surface—the optimizer deems both to have almost the same cost.
But when you look at the Seek Predicates and WHERE Predicate of the Index Seek operator in the plan for Listing 4’s query, you’ll see a difference. The Seek Predicates are
Prefix: \[tempdb\].\[Sales\]. \[SalesOrderHeader\]. CustomerID = \[tempdb\].\[Sales\]. \[SalesOrderHeader\]. \[CustomerID\] as \[C\].\[CustomerID\], End Range: \[tempdb\].\[Sales\]. \[SalesOrderHeader\]. OrderDate
The plan for Listing 3’s query returned more rows for the WHERE Predicate to evaluate. Those were the rows that satisfied the Seek Predicates PS.CustomerID = C.CustomerID presented here in abbreviated form). The plan for Listing 4’s query left far fewer rows for the WHERE Predicate to evaluate after satisfying the Seek Predicates PS.CustomerID = C.CustomerID, End Range:PS.OrderDate
More to Come
Query tuning and optimization is often more art than science. A small detail can dramatically influence your queries’ performance. The more you know about how the query optimizer works and how to analyze and interpret the information SQL Server’s tools provide—including not being misled by estimation information—the better you can optimize queries.
For this article’s task, I presented a cursor-based solution that ran in 58 seconds, a first attempt at a set-based solution that ran in about 2 minutes, and an optimized version of the set-based query that ran in 17 seconds. But I’m not done yet. There are faster solutions. Next month, I’ll cover even speedier solutions, one of which runs in only 6 seconds. In the meantime, can you think of faster solutions to this T-SQL problem?