 # Optimization Tips for Multiple Range Predicates, Part 2

Three practical examples, as well as tips to improve performance In "Optimization Tips for Multiple Range Predicates, Part 1," I describe the optimization of multiple conjunctive predicates (i.e., predicates separated by an AND operator). I explain that it's easy to arrange an efficient index when there's at most one column with a range predicate, with all remaining predicates being equality predicates. You place the column with the range predicate last in the index key list. However, when you have multiple range predicates involving different columns, there's simply no perfect index. For example, consider the following classic interval intersection test:

`WHERE startdate <= @end AND enddate >= @start`

Whether you create an index on (startdate, enddate) or (enddate, startdate), you won't get the qualifying rows in a consecutive range in the index leaf. Only the predicate against the leading index key will be used as a seek predicate. The predicate against the nonleading key will be applied as a residual predicate. This means that all rows satisfying the seek predicate will be scanned and will be tested for the residual predicate to determine whether they need to be returned.

In "Optimization Tips for Multiple Range Predicates, Part 1," I explain the problem of multiple range predicates. In this article I cover three practical examples that are affected by this problem, and I provide optimization tips to improve their performance. The examples I cover include interval intersection with recent periods, interval intersection with short periods, and improving subtree/descendants queries in the nested sets graph model.

### Interval Intersection: Recent Period

One of the special cases of multiple range predicates that can be optimized reasonably is that of interval intersection tests involving the recent period. To demonstrate the case I'll use a table called Intervals, which you can create and populate by running the code in Listing 1.

```SET NOCOUNT ON;
USE tempdb;
GO

-- Create helper function dbo.GetNums
-- Purpose: return a sequence of integers in the range @low through @high
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM L5)
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;
GO

-- Create and populate table Intervals
IF OBJECT_ID(N'dbo.Intervals', N'U') IS NOT NULL DROP TABLE dbo.Intervals;
GO
CREATE TABLE dbo.Intervals
(
id    INT  NOT NULL,
startdate DATE NOT NULL,
enddate   DATE NOT NULL,
CONSTRAINT CHK_Intervals_end_gteq_start CHECK(enddate >= startdate)
);
GO

DECLARE
@numintervals AS INT  = 10000000,
@minstartdate AS DATE = '20050101',
@maxenddate   AS DATE = '20141231',
@maxdiff      AS INT  = 30;

WITH C AS
(
SELECT
n AS id,
(ABS(CHECKSUM(NEWID())) % (DATEDIFF(day, @minstartdate, @maxenddate) - @maxdiff + 1)),
@minstartdate) AS startdate,
ABS(CHECKSUM(NEWID())) % (@maxdiff + 1) AS diff
FROM dbo.GetNums(1, @numintervals) AS Nums
)
INSERT INTO dbo.Intervals WITH(TABLOCK) (id, startdate, enddate)
SELECT id, startdate, DATEADD(day, diff, startdate) AS enddate
FROM C;

ALTER TABLE dbo.Intervals ADD CONSTRAINT PK_Intervals PRIMARY KEY(id);```

Suppose that the Intervals table holds closed-closed intervals (i.e., inclusive of both delimiters) representing car rental contract periods. The query you're supposed to optimize is one looking for contracts that were active during an input period represented by the inputs @start and @end.

Suppose that what's special about our scenario is that most queries are looking for intersections with a recent input period. The sample data spans 10 years (2005 through 2014), but suppose that most queries are concerned with a recent time period such as the past month or the past week.

The tricky thing is that for most people, the intuitive index to create to support intersection queries is on startdate as the leading key and enddate as the second key, like so:

`CREATE INDEX idx_sart_end ON dbo.Intervals(startdate, enddate);`

The problem is that this is a very inefficient index for recent period queries, as I'll demonstrate. First, turn on reporting performance statistics:

`SET STATISTICS IO, TIME ON;`

Next, run the following query looking for active intervals during the past month of activity:

```DECLARE
@start AS DATE = '20141201',
@end   AS DATE = '20141231';

SELECT id
FROM dbo.Intervals
WHERE startdate <= @end AND enddate >= @start;```

Figure 1 shows the plan for this query.

Observe that the seek predicate is based on the leading index key: startdate <= @end, and the residual predicate is based on the second key: enddate >= @start. Now think about the seek predicate. You realize that most of the rows in the table (about 99 percent) satisfy this predicate, resulting in a range scan of almost the entire index leaf level. SQL Server reported the following performance info for this query: logical reads 20012, CPU time 1156 ms, elapsed time 527 ms.

As I explain in "Optimization Tips for Multiple Range Predicates, Part 1," a far more efficient indexing strategy in a multi-range predicate situation is to place the more selective column first in the key list. With queries such as the query above, involving mainly the recent period, there are about 99 percent matches for the predicate startdate <= @end but only about 1 percent matches for the predicate enddate >= @start. So you're much better off defining the index on enddate as the leading key, like so:

```DROP INDEX idx_sart_end ON dbo.Intervals;
CREATE INDEX idx_end_start ON dbo.Intervals(enddate, startdate);```

Rerun the intersection query again after creating the new index:

```DECLARE
@start AS DATE = '20141201',
@end   AS DATE = '20141231';

SELECT id
FROM dbo.Intervals
WHERE startdate <= @end AND enddate >= @start;```

Figure 2 shows the plan for this query.

Observe that this time the more selective predicate is used as the seek predicate, resulting in the following performance statistics: logical reads 93, CPU time 47 ms, elapsed time 302 ms.

### Interval Intersection: Sarka's Optimization for Short Intervals

You saw how you can optimize recent period intersection queries well by creating an index on the enddate column as the leading key. But what if the intersection queries could ask about any period?

There's another special case that you can optimize well based on a solution by Dejan Sarka. This case is when both the stored intervals and the input interval are short. As an example, suppose that the stored contracts are limited to no more than a month, meaning that the difference between startdate and enddate is at most 30 days. Also, the input interval is short—perhaps a week.

Remember that the traditional solution won't perform well when the input interval can be any period. For example, here's an intersection query where the input period represents a week roughly in the middle of the entire period covered by the data:

```DECLARE
@start AS DATE = '20090101',
@end   AS DATE = '20090107';

SELECT id
FROM dbo.Intervals
WHERE startdate <= @end AND enddate >= @start;```

Whether your index is defined on (enddate, startdate) or the other way around, about half the data in the index leaf will need to be scanned. At the moment, the index you have is defined on (enddate, startdate), so you get a plan similar to the one shown earlier in Figure 2. Only this time about half the index leaf is scanned, so the performance statistics reported are as follows: logical reads 12036, CPU time 655 ms, elapsed time 581 ms.

Sarka's optimization adds a predicate against the leading index key, restricting it to a closed range—and by doing so substantially reduces the number of rows that need to be scanned. For example, you currently have an index on (enddate, startdate). Remember that the original intersection test uses the predicate enddate >= @start. This predicate already restricts the starting point for enddate. You know that the maximum difference between the interval delimiters (call that difference maxdiff) is 30 days; perhaps you even enforce this with a constraint. You compute the difference between the input interval's delimiters (call that difference inputdiff). The farthest endpoint for an interval that intersects with the input interval is @start + inputdiff + maxdiff, so you can add the predicate enddate <= DATEADD(day, DATEDIFF(day, @start, @end) + @max, @start) to limit the endpoint for the range scan. You can use a BETWEEN predicate with the original start point and the new endpoint as delimiters, like so:

```DECLARE
@start AS DATE = '20090101',
@end   AS DATE = '20090107',
@max   AS INT  = 30;

SELECT id
FROM dbo.Intervals
WHERE enddate BETWEEN @start
AND DATEADD(day, DATEDIFF(day, @start, @end) + @max, @start)
AND startdate <= @end;```

Figure 3 shows the plan for this query.

Observe that now there's an endpoint for the seek predicate. The performance information reported for this query execution is as follows: logical reads 200, CPU time 47 ms, elapsed time 370 ms.

If the index was defined on (startdate, enddate) you would apply the inverse calculation, subtracting inputdiff and maxdiff from @end to set the minimum start point for startdate, like so:

```DECLARE
@start AS DATE = '20090101',
@end   AS DATE = '20090107',
@max   AS INT  = 30;

SELECT id
FROM dbo.Intervals
WHERE startdate BETWEEN
DATEADD(day, -DATEDIFF(day, @start, @end) - @max, @end)
AND @end
AND enddate >= @start;```

Things get a bit trickier when the intervals are short but the maximum length isn't restricted, so you don't know ahead of time what the maximum length is. You need to compute the maximum length by querying the data. For such a query to be efficient, you can add a computed column to the table computing the difference between startdate and enddate (call it ilen for interval length) and index it, like so:

```ALTER TABLE dbo.Intervals ADD ilen AS DATEDIFF(day, startdate, enddate);
CREATE INDEX idx_ilen ON dbo.Intervals(ilen);```

Then instead of assigning a constant to the variable @max, you query the maximum ilen value from the data. Here's the solution code, assuming there's an index on (enddate, startdate):

```DECLARE
@start AS DATE = '20090101',
@end   AS DATE = '20090107',
@max   AS INT  = (SELECT MAX(ilen) FROM dbo.Intervals);

SELECT id
FROM dbo.Intervals
WHERE enddate BETWEEN @start
AND DATEADD(day, DATEDIFF(day, @start, @end) + @max, @start)
AND startdate <= @end;```

Figure 4 shows the plans for the queries.

The extra work compared with the plan in Figure 3 is negligible—that's the first plan in Figure 4 computing the maximum ilen by using the index on the ilen column.

Sarka's optimization works well when all stored intervals are short. Its weakness is that if there's even one long interval in the data, the solution becomes inefficient.

### Subtree/Descendants in Nested Sets Graph Model

Another example for the performance problem with multiple range predicates is a subtree/descendants query in the nested sets graph model. Figure 5 illustrates an employee tree with integer left and right values based on the nested sets model.

The idea is to represent the relationships between the nodes as set containment relationships. A child node has a left value that's greater than its parent's left value and a right value that's less than its parent's right value.

To return the subtree of a given node (call it @root), most people use a query with two range predicates (P represents the parent instance and C the child instance):

`P.nodeid = @root AND C.lft >= P.lft AND C.rgt <= P.rgt`

Run the code in Listing 2 to create the Tree table and populate it with 1,000,000 rows. Observe that the index idx_lft_rgt has the key list (lft, rgt).

```CREATE TABLE #Tree
(
nodeid   INT NOT NULL,
parentid INT NULL,
val      NUMERIC(12, 2)
);

INSERT INTO #Tree(nodeid, parentid, val)
SELECT
n AS nodeid,
NULLIF((n+8) / 10, 0) AS parentid,
CAST(1 + ABS(CHECKSUM(NEWID())) % 100 AS NUMERIC(12, 2)) AS val
FROM dbo.GetNums(1, 1000000) AS Nums;

CREATE UNIQUE CLUSTERED INDEX idx_pid_nid ON #Tree(parentid, nodeid);
ALTER TABLE #Tree ADD PRIMARY KEY NONCLUSTERED(nodeid);
GO

-- Materializing Nested Sets Relationships in a Table
IF OBJECT_ID(N'dbo.Tree', N'U') IS NOT NULL DROP TABLE dbo.Tree;

WITH TwoNums AS
(
SELECT n FROM (VALUES(1),(2)) AS D(n)
),
SortPath AS
(
SELECT nodeid, 0 AS lvl, n, CAST(CAST(n AS BINARY(4)) AS VARBINARY(MAX)) AS sort_path
FROM #Tree CROSS JOIN TwoNums
WHERE nodeid = 1

UNION ALL

SELECT C.nodeid, P.lvl + 1, TN.n,
P.sort_path + CAST(
(-1 + ROW_NUMBER() OVER(PARTITION BY C.parentid
ORDER BY C.nodeid)) / 2 * 2 + TN.n
AS BINARY(4))
FROM SortPath AS P
INNER JOIN #Tree AS C
ON P.n = 1
AND C.parentid = P.nodeid
CROSS JOIN TwoNums AS TN
),
Sort AS
(
SELECT nodeid, lvl, ROW_NUMBER() OVER(ORDER BY sort_path) AS sortval
FROM SortPath
),
NestedSets AS
(
SELECT nodeid, lvl, MIN(sortval) AS lft, MAX(sortval) AS rgt
FROM Sort
GROUP BY nodeid, lvl
)
SELECT E.nodeid, E.val, NS.lvl, NS.lft, NS.rgt
INTO dbo.Tree
FROM NestedSets AS NS
INNER JOIN #Tree AS E
ON E.nodeid = NS.nodeid;

CREATE UNIQUE CLUSTERED INDEX idx_uc_lft_rgt ON dbo.Tree(lft, rgt);
ALTER TABLE dbo.Tree ADD CONSTRAINT PK_Tree PRIMARY KEY NONCLUSTERED(nodeid);
GO

DROP TABLE #Tree;```

Run the following query to obtain the descendants of node 11112:

```DECLARE @root AS INT = 11112;

SELECT C.*
FROM dbo.Tree AS P
INNER JOIN dbo.Tree AS C
ON P.nodeid = @root
AND C.lft >= P.lft AND C.rgt <= P.rgt;```

Figure 6 shows the plan for this query.

Observe that only one of the range predicates is used as a seek predicate: C.lft >= P.lft. The other predicate is used as a residual predicate: C.rgt <= P.rgt. Even though there are only 11 matching rows, the execution of the query applied 5,241 logical reads.

Fortunately, a simple workaround exists. Think about the left and right values as delimiters of an interval. Using terminology from Allen's interval algebra, an ancestor's interval contains the intervals of all of its descendants. So if node P's interval contains node C's left delimiter, P is necessarily an ancestor of C, also containing its right delimiter. Therefore it's sufficient to test whether node C's left value is between node P's left and right values to know that C is a descendant of P. Based on this insight, you can rewrite the descendants query like so:

```DECLARE @root AS INT = 11112;

SELECT C.*
FROM dbo.Tree AS P
INNER JOIN dbo.Tree AS C
ON P.nodeid = @root
AND C.lft BETWEEN P.lft AND P.rgt;```

Now the query has only one range predicate against the leading index key (lft), resulting in a highly efficient plan as Figure 7 shows.

Observe that both the equality predicate and the single range predicate are used as seek predicates and that there's no residual predicate this time. Here you optimally exploit the fact that the left values give you topological sort order. The execution of this query applies only nine logical reads.

### More To Come

Unfortunately, the multiple range predicates problem is a common cause of poorly performing queries. I demonstrated the problem through interval intersection queries as well as queries requesting a subtree in the nested sets graph model. Fortunately, there are special cases in which you can apply optimization techniques to improve the performance of your solutions, such as in the three examples that I covered in this article. In my next article, I'll continue exploring the optimization of multiple predicates, providing recommendations for vector comparisons.

TAGS: SQL 