Sequential to Set-Based

Editor's Note: Send your experts-only T-SQL tips to Itzik Ben-Gan at [email protected]. If we use your tip in the magazine, you'll receive $100 and an exclusive T-SQL Black Belt shirt.

Programmers who have experience in sequential file processing, for example with COBOL, can encounter difficulty when they start programming in a relational environment such as SQL Server. Such programmers often try to solve T-SQL problems by using cursors, similar to the way they might program in their previous environments—working with one record at a time in sequential order. But this approach doesn't exploit the full power of a relational database engine, which is optimized for nonsequential set-based queries (i.e., queries that manipulate sets of rows nonsequentially). Furthermore, cursor-based solutions usually require considerably more code than their equivalent set-based solutions. If a cursor is the first thing that you think of when you face a T-SQL problem, you have some unlearning to do—you need to start thinking "set-based."

Let's look at a problem that seems to require a cursor-based solution but in fact has several set-based solutions. As I discuss the various set-based solutions (most of which SQL Server MVP Umachandar Jayachandran provided), I compare their performance. The example company I use tracks customer contract information in a table called Contracts. This information includes the customer ID, the date the contract was signed, and the contract's monetary amount. You can run the script that Listing 1 shows to create and populate the Contracts table.

Say you need to write a query that returns all the details of the latest contract (the contract with the most recent date) for each customer and includes the amount of the previous contract in the same result row. Usually, a good approach to solving a T-SQL problem is to break it into steps. For example, the first step here might be to return all the contracts and, for each contract, return the amount of the previous contract. The second step might be to filter the results of the first step so that only the latest contract for each customer remains.

When you're dealing with set-based queries, the term "previous" isn't always well defined. So you need to define a logical expression that determines which is the "previous" contract for a customer. For a given contract, you can define "the previous contract" as "the most recent contract for the same customer, having a contract date earlier than the current contract date." Listing 2 shows a first attempt at implementing this first step.

The code in Listing 2 retrieves all contracts from an instance of the Contracts table called C1. The tricky part here is obtaining the amount of the previous contract for a customer. To obtain this amount, the code in Listing 2 uses a correlated subquery in the SELECT list. The correlated subquery is two levels deep. The first subquery retrieves the amount from another instance of the Contracts table, C2, using the same customer ID as in the outer query (C1). The code then filters the result of the subquery by requesting only those rows in which the contract date is equal to "the previous contract date," a result I obtained by nesting another level with another subquery. The deepest subquery refers to another instance of the Contracts table, C3. This subquery returns the maximum (most recent) contract date for the same customer as in the first subquery, along with a contract date that's earlier than the contract date in the outer query (C1). If the outer query's contract is the first contract for that customer, the subquery returns NULL because no previous contract exists.

Table 1 shows the output of the query from Listing 2. Note that the row order in the output isn't guaranteed. If you want to guarantee that your output is sorted by custid and cdate, you need to add an appropriate ORDER BY clause.

Before moving on to the next step, let's take another look at the innermost subquery in Listing 2. Currently, that subquery finds the appropriate customer ID by using the following filter:

WHERE C3.custid = C2.custid

If you revise this filter to

WHERE C3.custid = C1.custid

the query returns the same results but with considerably less I/O cost. In many cases, SQL Server's optimizer generates the same execution plan for queries that are written differently but that logically perform the same task. Here, the optimizer produced different execution plans for the original query in Listing 2 and the query with the above revision, with the latter performing much better. As a test, issue the following command:

SET STATISTICS IO ON

then run both the code in Listing 2 and the code in Listing 3, which contains the revised query. With SQL Server 2000 Service Pack 1 (SP1), STATISTICS IO shows 112 logical reads for the query in Listing 2 but only 28 logical reads for the query in Listing 3. Even a small change can drastically affect a query's performance. The conclusion from this test should be that when you aim for better performing code, you should always test variations of queries that produce the same results and compare their performance.

The second step is to filter out everything except the rows containing the latest contract for each customer. You can achieve this goal by adding a WHERE clause containing another subquery that checks whether the contract date is equal to the maximum contract date for the customer. Listing 4 shows the code that implements Step 2. Note that the query in Listing 4 is ANSI-compliant.

Table 2 shows the output of Listing 4's query. STATISTICS IO shows 10 logical reads for the query in Listing 4. You can try implementing other approaches to solve the same problem and compare the performance of each. For example, instead of writing subqueries that calculate aggregations, you can write simpler subqueries that use the T-SQL—specific TOP clause, as the query in Listing 5 shows. The subquery in the SELECT list retrieves the first row that has the same customer ID as in the outer query and a contract date earlier than the one in the outer query, then sorts the rows by contract date in descending order. The sorting guarantees that the query will return the amount of the contract with the maximum (latest) date. The filter uses a similar technique to ensure that the query returns only the latest contract for each customer.

STATISTICS IO shows 18 logical reads for the query in Listing 5. So in terms of I/O, the query in Listing 4 performs better than the one in Listing 5 and also has the advantage of being ANSI-compliant. In some cases, a TOP query performs better than the alternatives, so testing different versions of your code can really pay off. A graphical execution plan for both queries in Query Analyzer shows that the total cost for Listing 4's query is slightly less than for the query in Listing 5. However, the difference isn't significant; in a close case like that, I usually stick with the ANSI-compliant code even when it's not the most efficient.

Alternatively, you could implement a totally different approach, such as the query that Listing 6 shows. This query uses joins and derived tables instead of subqueries. In some cases, a join query performs better than its subquery alternative. Listing 6 contains a GROUP BY query that returns the maximum contract date (mx_date) and the contract date previous to it (prev_mx_date) for each customer. This GROUP BY query returns a derived table that's aliased as G. Note that the LEFT OUTER JOIN statement between the derived table G and the Contracts table ensures that if a customer has only one contract, the query won't filter out that contract. The outer query joins the derived table to the Contracts table based on the same customer ID and a contract date that's equal to mx_date. The query then joins the result with another instance of the Contracts table, based on the same customer ID and a contract date equal to prev_mx_date. Then, the code simply displays the appropriate columns.

STATISTICS IO for the query in Listing 6 shows 42 logical reads, which is considerably more than the previous two queries. In this case, the approach that uses joins and derived tables has the worst performance, and the query in Listing 4, which uses subqueries, performs best.

What's Next?

I can't stress enough the importance of switching from a "sequential files" mindset to "set-based" thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code. In upcoming articles, I'll discuss some more-complex examples that seem to lend themselves to cursor-based solutions but are better off with a set-based solution.

Comments

Plain text