Supporting your query filters with the correct indexes is essential for optimal query performance. When your query filter has a single predicate that's a search argument (SARG), things are straightforward: You create an index with the filtered column as the key. However, with multiple conjunctive predicates (separated by AND operators), things become trickier. For example, you need to determine the ideal order of the index keys. Plus, when you have more than one range predicate involved, there's no real perfect index that you can create. Still, you need to figure out the best possible indexing strategy and the optimal query writing techniques. In addition, you sometimes need to redesign your existing solution to achieve better performance.
Related: Predicate-Based Query Filters
This article is the first in a two-part series on the topic of optimizing for multiple range predicates. In this article, I introduce the problem and provide the first optimization tip. In my next article, I'll cover several practical examples of the problem and different tips for optimizing your solutions.
To help generate the sample data for this article, you need a function called GetNums. Run the code in Listing 1 to create this function. The function accepts a low value and a high value as inputs and returns a sequence of integers in the input range.
SET NOCOUNT ON; USE tempdb; IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums; GO CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE AS RETURN WITH L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)), L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n FROM Nums ORDER BY rownum; GO
Use the code in Listing 2 to create the tables TStrNum and TNumStr, as well as supporting indexes and statistics. The two tables are structured the same and populated with the same sample data; only the indexes defined on them are different.
IF OBJECT_ID(N'dbo.TStrNum', N'U') IS NOT NULL DROP TABLE dbo.TStrNum; IF OBJECT_ID(N'dbo.TNumStr', N'U') IS NOT NULL DROP TABLE dbo.TNumStr; CREATE TABLE dbo.TStrNum ( keycol INT NOT NULL IDENTITY, string VARCHAR(10) NOT NULL, number INT NOT NULL ); CREATE TABLE dbo.TNumStr ( keycol INT NOT NULL IDENTITY, string VARCHAR(10) NOT NULL, number INT NOT NULL ); INSERT INTO dbo.TStrNum WITH (TABLOCK) (string, number) SELECT CHAR(ASCII('A') + S.n) AS string, N.n AS number FROM dbo.GetNums(0, 3) AS S CROSS JOIN dbo.GetNums(1, 10) AS N CROSS JOIN dbo.GetNums(1, 25000) AS M; INSERT INTO dbo.TNumStr WITH (TABLOCK) (string, number) SELECT CHAR(ASCII('A') + S.n) AS string, N.n AS number FROM dbo.GetNums(0, 3) AS S CROSS JOIN dbo.GetNums(1, 10) AS N CROSS JOIN dbo.GetNums(1, 25000) AS M; -- Indexes CREATE UNIQUE CLUSTERED INDEX idx_str_num ON dbo.TStrNum(string, number, keycol); CREATE UNIQUE CLUSTERED INDEX idx_num_str ON dbo.TNumStr(number, string, keycol); -- Trigger creation of histogram on second index key SELECT keycol, string, number FROM dbo.TStrNum WHERE number = 9 AND number % 2 = 0; SELECT keycol, string, number FROM dbo.TNumStr WHERE string = 'C' AND LEFT(string, 1) = 'A'; -- Update statistics with full scan UPDATE STATISTICS dbo.TStrNum WITH FULLSCAN; UPDATE STATISTICS dbo.TNumStr WITH FULLSCAN; GO
My original intent was to use only one table and show optimization aspects with two different indexing strategies. Instead, I decided to use two tables, each with a different index, to simplify my examples. This way, I didn't need to apply indexing changes before each example or use index hints. I could simply query the table with the index that I wanted to use in the example. I'll describe the index definitions shortly, after I describe the data itself.
The tables have a character string column called string and an integer column called number. The code populates each of the tables with 1,000,000 rows, with four distinct values in string (A, B, C, and D) and ten distinct values in number (1 through 10). There are 40 distinct combinations of values in string, number, which are evenly distributed (25,000 occurrences of each). In reality, you rarely get such clean, even distribution—but for our purposes, such distribution is sufficient.
The code creates the index idx_str_num on TStrNum(string, number, keycol), and the index idx_num_str on TNumStr(number, string, keycol). I'll discuss different types of query filters that involve two predicates—one based on string and the other based on number. When I want to discuss optimization with the column string as the leading key, I'll query the table TStrNum. Similarly, when I want to discuss optimization with the column number as the leading key, I'll query the table TNumStr.
The index rows are very short; approximately 366 rows fit in a leaf page on average. In my case, the index idx_str_num had 2,730 pages in the leaf, and the index idx_num_str had 2,732 pages.
Multiple Equality Predicates
The focus of this two-part series is understanding a fundamental optimization problem that exists when your query filter has multiple range predicates. But I'll get there gradually. I'll start with a discussion about multiple equality predicates, then continue with equality and range, and finally get to multiple range predicates.
To start the discussion about multiple equality predicates, consider the following queries:
SET STATISTICS IO ON; SELECT keycol, string, number FROM dbo.TStrNum WHERE string = 'C' AND number = 9; SELECT keycol, string, number FROM dbo.TNumStr WHERE string = 'C' AND number = 9;
What's the optimal indexing strategy here? With all predicates being equality predicates, irrespective of the order of the columns in the index key-list, qualifying rows appear in a consecutive range in the index leaf. So from this perspective, key-list order shouldn't matter. But you've probably heard a common recommendation suggesting that the leading key should be the most selective column—number in our case. Allegedly (according to the recommendation), SQL Server creates a histogram on the leading key but not on the other columns; so if you follow the recommendation, the optimizer can produce better cardinality estimates, which tend to lead to more optimal choices.
It's true that SQL Server will create a histogram on the leading key when you create the index. But it will also create single-column histograms on other columns if needed when you query the data, as long as the AUTO_CREATE_STATISTICS database property is on (which is the default).
Furthermore, cardinality estimation for multiple equality predicates can be performed based on a variety of methods. Which method the optimizer uses depends on the version of SQL Server, heuristics, and other factors. You'll find excellent coverage of the topic by Paul White and Benjamin Nevarez. For example, in "Cardinality Estimation for Multiple Predicates," Paul explains how cardinality estimates are performed for multiple predicates.
When you start exploring the different methods the optimizer uses, you find that there's little relevance to having the most selective column as the leading key in the index (again, thinking of equality predicates for now). I won't go into a lot of detail regarding the different cardinality estimation methods here since it's not my focus, and Paul and Ben have such excellent coverage of the topics. But I'd like to demonstrate that the estimates in our examples are similar irrespective of index key order.
Figure 1 shows the plans that I obtained for the previously mentioned queries when I ran them on SQL Server 2014. Notice that in both cases the estimates are the same (albeit not completely accurate).
The cardinality estimation method SQL Server 2014 uses in our examples (as long as the database compatibility mode is 120, or if using trace flag 2312) is called exponential backoff. The formula is:
S1 * SQRT(S2) * ... * SQRT(SQRT(Sn)) ... * input_cardinality
S1 is the most selective predicate, S2 the second most selective, and so on. In our case S1 is 0.1 (number = 9) and S2 is 0.25 (string = 'C'). So our formula becomes:
0.1 * SQRT(0.25) * 1000000 = 50000
The order of the columns in the index doesn't affect the selectivity of the predicates; therefore you get the same estimate in both. Recall that in our case the distribution of the values is completely even; that's why there's a significant difference between the estimate and the actual value. But again, I don't want to digress here. My point is that the estimates are the same in both cases.
Even when using the pre-2014 cardinality estimator (database compatibility mode <120, or if using trace flag 9481), I got the same estimates for both queries—25,000. You would get such an estimate with our evenly distributed data either when using a density vector (average) or an independence assumption (data in columns isn't correlated). With a density vector, you get the same density of 1 / 40000 = 0.025 (1 divided by number of distinct cases) whether the vector is (string, number) or (number, string). Then 0.025 * 1000000 = 25000. With an independence assumption, the final estimate is the product of the selectivity estimates of the different predicates:
S1 * S2 * ... * Sn * input_cardinality
which in our case is 0.1 * 0.25 * 1000000 = 25000. Clearly, irrespective of the order of the columns in the index, the estimate is the same.
The conclusion is that with multiple equality predicates, you should create an index with all filtered columns in the key-list, but the key order isn't important as long as the query filter contains conjunctive equality predicates based on all columns. With such an index, the optimizer can perform an index seek, as shown in both plans in Figure 1. Both predicates are used as the Prefix for Seek Predicates. This means that the range scan in the index leaf scans only the pages that contain qualifying rows. Both queries perform 72 logical reads.
Obviously, for the optimizer to be able to perform an efficient index seek, you must be filtering by leading index keys. So an index seek can be performed in the index idx_str_num even when filtering by only the column string, but not in the index idx_num_str. For example, consider the following queries:
SELECT keycol, string, number FROM dbo.TStrNum WHERE string = 'C'; SELECT keycol, string, number FROM dbo.TNumStr WHERE string = 'C';
Figure 2 shows the plans for these queries.
The plan for the first query performs an efficient index seek in the index idx_str_num, resulting in 690 logical reads. The plan for the second query can't apply an index seek in the index idx_num_str because the filtered column isn't a leading key. So the optimizer ends up choosing a full index scan, resulting in 2,732 reads.
The conclusion is that if you always filter by one of the columns and sometimes by both, make sure you place the one you always filter by as the leading key in the index.
Equality and Range
When you have one or more equality predicates and one range predicate, you can still support them with an optimal index. The columns that appear in the equality predicates should appear first in the index key-list, and the range column last. This way, you'll get qualifying rows in a consecutive range in the index leaf. Again, it has nothing to do with the selectivity of the predicates.
If the range column appears first in the key-list, qualifying rows don't appear in a consecutive range in the index leaf, resulting in more pages that need to be scanned. Consider the following two queries:
SELECT keycol, string, number FROM dbo.TStrNum WHERE string = 'C' AND number >= 9; SELECT keycol, string, number FROM dbo.TNumStr WHERE string = 'C' AND number >= 9;
The first query is supported by an index that follows the recommendation, and the second query isn't. Figure 3 shows the plans for these queries.
The first plan is optimal. Both predicates appear as Seek Predicates. The start of the range scan in the index leaf is based on the combination of the properties Prefix (string = @1) and Start (number >= @2), and the end of the range scan is based just on Prefix. Only leaf pages containing qualifying rows need to be scanned. The execution of this plan involved 142 logical reads.
Table 1 shows the rows in the order that they appear in the index idx_str_num, marking scanned rows with the letter S and returned rows with the letter R. Observe that only returned rows need to be scanned.
string number ---------- ----------- ... C 6 C 7 C 8 C 9 -- SR C 10 -- SR D 1 D 2 D 3 D 4 D 5 ...
The second plan isn't optimal because the qualifying rows don't appear in a consecutive range in the index. The range predicate number >= @2 appears as the Start property of the Seek Predicates property, meaning that all rows in the leaf of the index idx_num_str that satisfy this predicate need to be scanned. Then the equality predicate string = @1 appears as the Predicate property. You can think of this predicate as a residual one—it's applied to scanned rows to determine whether to return them.
This plan results in 413 reads—three times more than with the optimal indexing strategy. Table 2 illustrates which rows in the index leaf are scanned and which are returned. The conclusion is that when you have one or more equality predicates and one range predicate, you should place the column from the range predicate last in the index key-list.
string number ---------- ----------- ... C 8 D 8 A 9 B 9 C 9 -- SR D 9 -- S A 10 -- S B 10 -- S C 10 -- SR D 10 -- S
Multiple Range Predicates
So far you've seen that it's easy to create an optimal indexing strategy for conjunctive predicates as long as there's at most one range predicate involved; you simply place all equality columns first, and the range one last (if a range predicate exists). However, when you have multiple range predicates involved based on different columns, there's no perfect indexing strategy. If you think about it, there's simply no index that will arrange all qualifying rows in a consecutive range in the index leaf. Now the selectivity of the predicates becomes important to minimize the work. You want to place the column involved in the most selective predicate first.
Consider the following two queries:
SELECT keycol, string, number FROM dbo.TStrNum WHERE string >= 'C' AND number >= 9; SELECT keycol, string, number FROM dbo.TNumStr WHERE string >= 'C' AND number >= 9;
Figure 4 shows the plans for these queries.
Observe that the Start property of the Seek Predicates property in both cases is based on both range predicates, but all remaining rows to the right of the starting point in the index must be scanned. The predicate involving the nonleading key is applied as a residual one against the scanned rows to determine which rows to return.
The predicate number >= 9 is clearly more selective than string >= 'C'. Therefore, the plan for the second query that uses the index idx_num_str is more efficient (413 reads) than the plan for the second query that uses the index isx_str_num (822 reads).
string number ---------- ----------- ... C 6 C 7 C 8 C 9 -- SR C 10 -- SR D 1 -- S D 2 -- S D 3 -- S D 4 -- S D 5 -- S D 6 -- S D 7 -- S D 8 -- S D 9 -- SR D 10 -- SR
Table 3 shows the data access (rows scanned and rows returned) for the first query, and Table 4 shows the data access for the second query.
string number ---------- ----------- ... A 8 B 8 C 8 D 8 A 9 B 9 C 9 -- SR D 9 -- SR A 10 -- S B 10 -- S C 10 -- SR D 10 -- SR
You can see that in both cases the returned rows are a subset of the scanned rows. However, it's clear that with the most selective column first in the index key-list, fewer rows need to be scanned.
Multiple range predicates can be the source of query performance problems. In this article I explained why that's the case. It's easy to come up with an optimal indexing strategy with one or more equality predicates and at most one range predicate, but there's simply no perfect index to support multiple range predicates. You can only try to minimize the work by defining the index with the most selective column first. Next month I'll cover three practical examples of query performance problems resulting from the use of multiple range predicates, and I'll provide tips for optimizing the solutions.