T-SQL Deep Dives: Create Efficient Queries

T-SQL is a language with endless depths. When faced with tasks that involve querying, sometimes you need to dive deep to deal with both the logical aspects and the performance aspects of the possible solutions. Coming up with a solution that works isn't always a trivial task; coming up with a solution that not only works but also performs well is an even bigger challenge. There are so many variables involved that mastering queries can take a long time. To make that time a bit shorter for you, I'll share some tips and techniques on how to efficiently perform some common querying tasks, including:

Calculating a MIN or MAX aggregate for very dense groups
Using dynamic filters in applications
Grouping data and filtering out those groups with more than one distinct occurrence of an attribute
Creating bushy plans
Removing extra spaces
Using search arguments

To follow along with the examples I'll be discussing, you can download the 125389.zip file by clicking the Download the Code Here button near the top of the page. The 125389.zip file includes several scripts as well as the T-SQL Tips and Techniques.sql file, which includes all the code, even the inline queries.

Before running each solution, you should execute the code

CHECKPOINT;
DBCC DROPCLEANBUFFERS;

to clear the data cache. That way, all the solutions will start with the same cold cache environment.

You should also run the code

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

to turn on the performance measuring options in your session.

TOP vs. MIN or MAX

The first task I want to show you involves calculating a MIN or MAX aggregate for very dense groups. For example, consider a shippers-orders scenario in which there are a very small number of shippers in the system, each handling a large number of orders. The task is to efficiently calculate the last shipping date for each shipper.

To start, run the CreateGetNumsFunction.sql script in the 125389.zip file to create a helper function named GetNums. This function returns a sequence of integers of a requested size. Then, run the CreateShippersAndOrdersTables.sql script to create and populate the Shippers and Orders tables. The Shippers table has three rows, whereas the Orders table has about one million orders per shipper. CreateShippersAndOrdersTables.sql also creates an index on Orders(shipperid, orderdate) to support the solution.

When asked to write a query that returns the last ship date for each shipper, many people will come up with a simple GROUP BY query such as

SELECT shipperid, MAX(shipdate)   AS lastshipdate FROM dbo.Orders GROUP BY shipperid;

However, when optimizing this query, which I'll call Query 1, SQL Server chooses to perform a full ordered scan of the leaf of the index created on the Orders table for this task, as shown in Figure 1 under Query 1. With about three million rows in the leaf of the index, this plan ends up performing 7,136 logical reads, with a CPU time of 1,045ms and an elapsed time of 2,956ms on my system.

Because the three shippers are very dense groups, it would make more sense to scan the Shippers table and, for each shipper, perform a seek operation in the index on Orders. To implement this plan, you could rewrite the solution to

SELECT shipperid,   ( SELECT MAX(shipdate)     FROM dbo.Orders AS O     WHERE O.shipperid = S.shipperid )     AS lastshipdate FROM dbo.Shippers AS S;

However, if you examine the plan for this query (see Query 2 in Figure 1), you'll see that the leaf of the index on Orders was fully scanned again. This occurred because SQL Server tried to optimize the query instead of just interpreting it literally. SQL Server's optimizer unnested the scalar subquery and internally rearranged it to a join form. When expressed as a join, the optimizer can optimize the query from both directions, not just the one literal form. But then so many optimization opportunities arose that they exploded to a huge number, and the optimizer never went back to the literal written form. In short, the optimizer somehow got too sophisticated and didn't see what was under its nose, so to speak.

In most cases, unnesting usually leads to better plans, but that wasn't the case in this situation. SQL Server ended up performing two logical reads against the Shippers table and 7,136 logical reads against the Orders table. As for CPU and elapsed time, those were 936ms and 2,990ms, respectively.

So, how can you tell SQL Server not to unnest the scalar aggregate subquery? Apparently, you can't do this directly, but there's a way to achieve this indirectly. When using TOP with ORDER BY, the optimizer doesn't unnest because it could lead to a different result in certain cases. Even when the subquery can be logically converted to a different form that can be unnested without adverse results (as is the case here), the optimizer won't unnest. That's probably because SQL Server's developers know that some people rely on this behavior as a kind of optimization hint. Whether this behavior will continue to work the same in the future is to be seen, so whether you want to rely on this behavior is up to you. But for now that's how it works.

To prevent the optimizer from unnesting the scalar aggregate subquery, you need to replace the MAX aggregate with the logically equivalent TOP (1) like so

SELECT shipperid,
  ( SELECT TOP (1) shipdate
    FROM dbo.Orders AS O
    WHERE O.shipperid = S.shipperid
    ORDER BY shipdate DESC )
    AS lastshipdate
FROM dbo.Shippers AS S;

This time you get the plan that you were hoping to see, as Query 3 in Figure 1 shows. SQL Server first scans (only two reads) the clustered index of the Shippers table. Then, for each of the three shipper rows, SQL Server performs a seek operation in the index on Orders to retrieve the last shipdate value for the current shipper from the end of that shipper's segment in the index leaf. This work requires only three reads per shipper, amounting to a total of only nine reads against the Orders table. STATISTICS TIME reports a CPU time of 0ms and an elapsed time of 238ms.

In the T-SQL Tips and Techniques.sql file, you'll find these three queries if you'd like to try them. After you're done, execute the clean-up code

DROP TABLE dbo.Orders, dbo.Shippers;

to remove the tables you created.

Dynamic Filters

Having to use dynamic filters (also known as dynamic search conditions) is a common need in applications. The applications provide an interface for the user to filter data, and the user decides which attributes to filter the data by. There are many ways to address this need. You can find an excellent and extensive discussion on this topic in SQL Server MVP Erland Sommarskog's paper "Dynamic Search Conditions in T-SQL".

For our purposes, I'll discuss one of the most common implementations of dynamic filters, which the code in Listing 1 demonstrates.

Listing 1: The GetOrders Procedure with the RECOMPILE Procedure Option

SET NOCOUNT ON;
USE AdventureWorks;
  
CREATE INDEX idx_OrderDate ON Sales.SalesOrderHeader(OrderDate);
  
IF OBJECT_ID('dbo.GetOrders') IS NOT NULL DROP PROC dbo.GetOrders;
GO
  
CREATE PROC dbo.GetOrders
  @SalesOrderID AS INT = NULL,
  @SalesOrderNumber AS NVARCHAR(50) = NULL,
  @CustomerID AS INT = NULL,
  @OrderDate AS DATETIME = NULL
WITH RECOMPILE
AS
  
SELECT *
FROM Sales.SalesOrderHeader
WHERE (SalesOrderID = @SalesOrderID OR @SalesOrderID IS NULL)
  AND (SalesOrderNumber = @SalesOrderNumber OR @SalesOrderNumber IS NULL)
  AND (CustomerID = @CustomerID OR @CustomerID IS NULL)
  AND (OrderDate = @OrderDate OR @OrderDate IS NULL);
GO

In this example, an application allows the user to filter orders from the Sales.SalesOrderHeader table in the AdventureWorks database by four attributes: sales order ID, sales order number, customer ID, and order date. When the user specifies the filters of interest and submits the data request, the application executes the GetOrders procedure in SQL Server, passing values only in the arguments corresponding to the filtered attributes.

In GetOrders, each argument @p is addressed in the static query's WHERE clause with the predicate

col = @p OR @p IS NULL

When a parameter value isn't specified, this predicate is always true and therefore won't filter out any rows. When a parameter value is specified, the @p IS NULL part is always false, so only the col = @p part will dictate whether the row should be filtered out. So, in terms of logical behavior, this solution will give you the correct result. As for performance, things are a bit trickier.

As Listing 1 shows, I defined the procedure with the RECOMPILE procedure option. I hoped that the optimizer would realize that the execution plan of the procedure's query isn't going to be reused and therefore it's safe to generate a plan that's suitable for the specific execution. Unfortunately, though, when the RECOMPILE option is specified at the procedure level, SQL Server still tries to come up with a plan that's reusable—never mind that it's not going to be reused. So, the optimizer doesn't ignore the parts of the WHERE clause that are inapplicable for the current execution. As a result, in simple terms, you get suboptimal plans.

To see for yourself, request the execution plans for these invocations of the GetOrders procedure

EXEC dbo.GetOrders
  @SalesOrderID = 43659;
EXEC dbo.GetOrders
  @SalesOrderNumber = N'SO43659';
EXEC dbo.GetOrders
  @CustomerID = 676;
EXEC dbo.GetOrders
  @OrderDate = '20030212';

Figure 2 shows the plans I received. Let's look at a couple of the results. For the first invocation (Query 1), an optimal plan would be a seek operation in the clustered index created on SalesOrderID, but I received a far less efficient plan. Similarly, for the second invocation (Query 2), an optimal plan would be a seek operation in the nonclustered index on SalesOrderNumber followed by a lookup, but again I got a far less efficient plan.

Figure 2: Suboptimal plans for dynamic filters

In SQL Server 2005, you still get suboptimal plans for this implementation, even if you use the RECOMPILE statement option in Listing 2 instead of the RECOMPILE procedure option.

Listing 2: The GetOrders Procedure with the RECOMPILE Statement Option

ALTER PROC dbo.GetOrders
  @SalesOrderID AS INT = NULL,
  @SalesOrderNumber AS NVARCHAR(50) = NULL,
  @CustomerID AS INT = NULL,
  @OrderDate AS DATETIME = NULL
AS
  
SELECT *
FROM Sales.SalesOrderHeader
WHERE (SalesOrderID = @SalesOrderID OR @SalesOrderID IS NULL)
  AND (SalesOrderNumber = @SalesOrderNumber OR @SalesOrderNumber IS NULL)
  AND (CustomerID = @CustomerID OR @CustomerID IS NULL)
  AND (OrderDate = @OrderDate OR @OrderDate IS NULL)
OPTION (RECOMPILE);
GO

However, the release to manufacturing (RTM) version of SQL Server 2008 changes the optimization of the code when the RECOMPILE statement option is specified. When a certain part of the filter is irrelevant (like when the predicate is always true), it's ignored. This leads to getting optimal plans when you run the code in Listing 2 on SQL Server 2008 RTM. Figure 3 shows the plans I received.

Figure 3: Optimal plans for dynamic filters

Unfortunately, though, the changes in the internal implementation of the RECOMPILE statement option incur a serious bug. When two sessions run the procedure with the query using the RECOMPILE statement hint, one session can end up using the other's arguments. As you can imagine, SQL Server's developers couldn't leave this bug in. But apparently it was too complicated to fix in the timeframe they had to release SP1, so they ended up reverting the behavior to that found in SQL Server 2005. So, if you're running SQL Server 2008 SP1, you won't get the bug but you also won't get improved plans. Since the release of SP1, the developers managed to fix the bug while providing the improved optimization. To get the improved nonbuggy version you need to be running SQL Server 2008 SP1 and cumulative update 5 (CU5) or later.

In terms of bug fixes, SQL Server 2008 R2 RTM's code base was finished earlier than SQL Server 2008 SP1 CU5, so unfortunately, you'll get suboptimal plans when you use the RECOMPILE statement option on SQL Server 2008 R2 RTM. A fix allowing the optimal plans was released in SQL Server 2008 R2 RTM CU1, so you will need to run this version (or a later one) to get good plans.

I know it can be confusing, so here's a summary of the behavior you should expect for the implementation of the procedure in Listing 2:

SQL Server 2005—suboptimal plans
SQL Server 2008 RTM—optimal plans but with bug
SQL Server 2008 SP1—suboptimal plans
SQL Server 2008 SP1 + CU5 and later—optimal plans with no bug
SQL Server 2008 R2 RTM—suboptimal plans
SQL Server 2008 R2 RTM + CU1 and later—optimal plans with no bug

When you're done running Listing 2, execute the code

DROP INDEX
  Sales.SalesOrderHeader.idx_OrderDate;
DROP PROC dbo.GetOrders;

to remove the index and procedure you created.

Distinct-Count Filters

SQL Server MVP Peter Larsson has come up with a nice optimization technique. The scenario is this: You need to group data and filter out those groups with more than one distinct occurrence of some attribute. For example, suppose you need to query the Sales.SalesOrderHeader table in the AdventureWorks database, returning customers who were handled by more than one salesperson. Most people will intuitively address this task with code such as

USE AdventureWorks;
SELECT CustomerID
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
HAVING COUNT(DISTINCT SalesPersonID)
  > 1;

However, calculating distinct aggregates such as COUNT(DISTINCT SalesPersonID) often requires more resources and takes longer than calculating non distinct aggregates such as MIN(SalesPersonID) and MAX(SalesPersonID). With this in mind, instead of using the predicate

COUNT(DISTINCT ) > 1

you can use

MIN() <> MAX()

In this example, the query would look like

SELECT CustomerID
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
HAVING MIN(SalesPersonID)
  <> MAX(SalesPersonID);

Figure 4 shows the plans for both queries. The query with the COUNT(DISTINCT …) option appears as Query 1 and the other as Query 2. You can see that the second query is more efficient. In this example, it appears to be two times more efficient than the first. Sometimes the difference can be even more dramatic.

Figure 4: Suboptimal and optimal plans for a distinct-count filter

Bushy Plans

Here's a tip that I learned from Lubor Kollar. With multi-join queries the optimizer by default considers plan tree layouts where at least one base input (which means it's not the result of a join) is involved. So each join is either a join between two base inputs or a base input and the result of a join. You can still have a lot of variety in the plan tree layouts (e.g., left-deep plans, right-deep plans, mixed layouts), but with every join that you see in the plan, at least one of the inputs is not the result of a join.

On occasion you might end up with a more optimal plan when both inputs are the results of joins. Such a plan tree layout is known as a bushy plan because it resembles a bush. An example where this strategy could be the most optimal is when the join on each side does very significant filtering.

To demonstrate how to get a bushy plan, first run CreateTablesT1-T4.sql in the 125389.zip file to create four tables named T1, T2, T3, and T4. Suppose you need to optimize the query

SELECT *
FROM dbo.T1
  JOIN dbo.T2
    ON T2.b = T1.b
  JOIN dbo.T3
    ON T3.c = T1.c
  JOIN dbo.T4
    ON T4.c = T3.c;

Currently, you don't get a bushy plan, as Query 1 in Figure 5 shows.

You suspect that the join ordering could be handled more efficiently with a bushy plan. To get a bushy plan, you need to rearrange the ON clause ordering by listing each of the two base joins separately, then using the ON clause to relate their results. You also need to add the FORCE ORDER query hint to force the join processing order to resemble keyed-in order. The revised query looks like

SELECT *
FROM
  (dbo.T1 JOIN dbo.T2 ON T2.b = T1.b)
  JOIN
  (dbo.T3 JOIN dbo.T4 ON T4.c = T3.c)
  ON T3.c = T1.c
OPTION(FORCE ORDER);

Query 2 in Figure 5 shows the bushy plan that you get for this query.

To clean up, run the code

DROP TABLE
  dbo.T1, dbo.T2, dbo.T3, dbo.T4;

to remove the tables you created.

Extra Spaces

Here's another nifty trick courtesy of Peter Larsson. The task at hand is a follows: given a character string (@s), replace all sections in the string that have multiple consecutive spaces with a single space. For example, suppose you need to turn the string 'abc def ghi' into 'abc def ghi'. I'll explain the steps in Larsson's technique, then present the complete solution expression.

The first step is to replace each single space in the string with a space followed an unused character. (space + ). It's important to choose a character that you know for sure isn't used in the data. In this case, let's say that the tilde character (~) doesn't appear in the data, so the first step would be

SELECT REPLACE(@s, ' ', ' ~');

The result is 'abc ~ ~ ~ ~ ~ ~ ~ ~ ~ ~def ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ghi'.

The next step is to replace each occurrence of + space with an empty string, as in

SELECT REPLACE   (REPLACE(@s, ' ', ' ~'), '~ ', '');

At this point, the result is 'abc ~def ~ghi'.

Finally, you need to replace each occurrence of space + with a space using the code

SELECT REPLACE   (REPLACE(REPLACE   (@s, ' ', ' ~'), '~ ', ''), ' ~', ' ');

The desired result—'abc def ghi'—is then achieved. Listing 3 contains the complete solution, including sample input, for you to test.

Listing 3: Code That Removes Spaces

DECLARE @s AS VARCHAR(1000) = 'abc      def      ghi'; SELECT REPLACE(REPLACE(REPLACE(@s, ' ', ' ~'), '~ ', ''), ' ~', ' ');

Search Arguments

Using search arguments is another technique you can use to optimize queries. A search argument is a filter expression that the optimizer can use to perform an index access method that relies on index ordering, like an index seek operation. For instance, the expression

WHERE col <= @p

is a search argument. The optimizer will evaluate the filter expression and determine whether or not it's efficient to use the index in a manner that relies on index ordering (e.g., an index seek), based on such factors as index availability, selectivity of the filter, and index coverage. However, in most cases, when you manipulate the filtered column, the expression isn't a search argument anymore. For example, the expression

WHERE col - 1 <= @p

isn't a search argument. Therefore, even if an index exists on col, the optimizer won't rely on index ordering. There are a few exceptions, but that's the general rule. Sometimes there's not much you can do about the fact that the expression isn't a search argument, but other times you can turn it into a search argument by making minor revisions. In this example, you can change col - 1 <= @p to col <= @p + 1 to make the expression a search argument.

Here are a couple of tangible examples. Suppose that you need to query the Person.Contact table in the AdventureWorks database. You need to return those contacts whose email address starts with nigel. A query in which the filter expression isn't a search argument might look like

USE AdventureWorks; SELECT * FROM Person.Contact WHERE LEFT(EmailAddress, 5) = 'nigel';

By using LIKE with a constant as the start of the pattern, you can turn the expression into a search argument like so

SELECT * FROM Person.Contact WHERE EmailAddress LIKE 'nigel%';

Figure 6 shows the plans in both cases. Observe that only the second query relied on index ordering and was optimized more efficiently.

Figure 6: Plans for queries with and without a search argument

Now suppose you need to query the Sales.SalesOrderHeader table in the AdventureWorks database, returning orders placed the day before the input date. To aid the performance of the solution, this example uses an index, which can be created with the code

CREATE INDEX idx_OrderDate
  ON Sales.SalesOrderHeader(OrderDate);

A query in which the filter expression isn't a search argument might look like

DECLARE @dt AS DATETIME = '20030212'; SELECT * FROM Sales.SalesOrderHeader WHERE DATEADD(day, 1, OrderDate) = @dt OPTION(RECOMPILE);

(Note that the RECOMPILE query option is used to allow the optimizer to sniff the variable value.) By applying a little math, you can turn the expression to a search argument like this

DECLARE @dt AS DATETIME = '20030212'; SELECT * FROM Sales.SalesOrderHeader WHERE OrderDate = DATEADD(day, -1, @dt) OPTION(RECOMPILE);

If you run both queries, you'll see how the query with the search argument relied on index ordering and was optimized more efficiently. After you run the queries, execute the code

DROP INDEX   Sales.SalesOrderHeader.idx_OrderDate;

for clean-up purposes.

A Lifetime Venture

There are so many variables involved in the performance of T-SQL code that mastering query tuning is a lifetime venture. You need to patiently examine endless cases, learn from each case, and constantly gather tips and techniques. Eventually, when faced with new tasks, all the knowledge that you accumulated should flow naturally and help you address those tasks both elegantly and efficiently. I presented a few common tasks and provided tips and techniques on how to solve them efficiently. I hope that you found the information useful and that you'll be able to put what you've learned here into practice.

Comments

Plain text