Solving Gaps and Islands with Enhanced Window Functions

Gaps and islands are classic problems in SQL that involve identifying ranges of missing values and ranges of existing values in a sequence. Real-life examples of gap and island problems include nonavailability and availability reports, periods of inactivity and periods of activity reports, and so on. For extensive coverage of gap and island problems and solutions prior to SQL Server 2012, see "SQL Server MVP Deep Dives."

In this article, I cover new solutions to gap and island problems, using enhanced window functions that were introduced in SQL Server 2012. For details about the enhanced window functions in SQL Server 2012, see:

Sample Data and Testing

In my tests, I used a table called T1 with an integer column called col1 holding the sequence values. The code in Listing 1 (below) creates the table T1 and fills it with a small set of sample data -- allowing you to check the correctness of the solutions.

For performance testing, you need much larger sets of sample data. The code in Listing 2 (below) creates a helper function called GetNums that accepts two inputs called @low and @high and returns a sequence of integers in that range.

The code in Listing 3 (below) uses the GetNums function to populate T1 with values in the range 1 through 10,000,000 with only a few gaps (about 100). The code in Listing 4 populates T1 with values in the range 1 through 10,000,000 with many gaps (about 1,000,000).

When I tested the solutions covered in this article, I measured the performance of a single execution of the query, as well as a stress test with two iterations of 20 concurrent executions of the query. For the stress test I used Adam Machanic's SQLQueryStress tool. I'll provide performance results that I obtained for both the few gaps and the many gaps cases, and for both the single execution of the query and the average of the multiple executions in the stress test.

Gaps

The gaps problem involves identifying the ranges of missing values in the sequence. Figure 1 shows the desired result for the small set of sample data in T1.

Figure 1: Desired Result for Gaps Problem

First, I'll cover a solution using subqueries that's supported in all versions of SQL Server. Then, I'll cover a solution that uses features that are new in SQL Server 2012.

Solution to gaps problem, using subqueries. The following query is a classic solution to the gaps problem, using subqueries:

SELECT col1 + 1 AS rangestart,

(SELECT MIN(B.col1)

FROM dbo.T1 AS B

WHERE B.col1 > A.col1) - 1 AS rangeend

FROM dbo.T1 AS A

WHERE NOT EXISTS

(SELECT *

FROM dbo.T1 AS B

WHERE B.col1 = A.col1 + 1)

AND col1 < (SELECT MAX(col1) FROM dbo.T1);

This query filters only col1 values that appear before a gap (the value plus 1 doesn't exist). It also filters only the col1 values that are less than the maximum value in the table. Then in the SELECT clause the query matches to each value before a gap the value that appears right after the gap (the minimum value that's greater than the current value). What's left is to add 1 to the value before the gap and subtract 1 from the value after the gap to obtain the ranges of missing values.

This solution is quite efficient. I obtained the following performance results for the query:

Single query, few gaps: 3 seconds, 32577 logical reads
Single query, many gaps: 6 seconds, 3029043 logical reads
Stress test, few gaps: 13 seconds, 32577 logical reads
Stress test, many gaps: 27 seconds, 3029043 logical reads

Solution to gaps problem, using LEAD. SQL Server 2012 provides a very simple and elegant solution to the gaps problem, using the new LEAD function. The LEAD function returns the requested element from the next row based on the indicated ordering. Here's the new solution:

WITH C AS

(

SELECT col1 AS cur, LEAD(col1) OVER(ORDER BY col1) AS nxt

FROM dbo.T1

)

SELECT cur + 1 AS rangestart, nxt - 1 AS rangeend

FROM C

WHERE nxt - cur > 1;

The query that creates the CTE C returns for each current col1 value (aliased as cur) the next col1 value (aliased as nxt). Then the outer query against the CTE returns only pairs of cur and nxt values that are more than one apart. As you can guess, those pairs represent the values surrounding the gaps. The query then adds 1 to the cur value and subtracts 1 from the nxt value to return the actual ranges of missing values. Beautifully simple!

I obtained the following performance results for the query:

Single query, few gaps: 11 seconds, 16138 logical reads
Single query, many gaps: 9 seconds, 13718 logical reads
Stress test, few gaps: 45 seconds, 16138 logical reads
Stress test, many gaps: 45 seconds, 13718 logical reads

It's a bit disappointing to see that in both the single execution test and the stress test, the old solution was faster than the new one. However, the new solution requires significantly fewer I/O operations. In addition, it's much much simple and therefore easier to maintain.

Normal Islands

The normal islands problem involves identifying the ranges of existing values. Figure 2 shows the desired result for the small set of sample data in T1.

Figure 2: Desired Result for Normal Islands Problem

The following query is a classic solution to the normal islands problem, using features that are supported by SQL Server 2005 and later:

WITH C AS

(

SELECT col1, col1 - DENSE_RANK() OVER(ORDER BY col1) AS grp

FROM dbo.T1

)

SELECT MIN(col1) AS rangestart, MAX(col1) AS rangeend

FROM C

GROUP BY grp;

The inner query that defines the CTE C uses an expression that uniquely identifies each island. This is achieved by using the DENSE_RANK function. Within each island, both the col1 values and the DENSE_RANK values keep incrementing by an interval of 1 integer. Therefore the difference between the two is constant within each island. Also, when you jump from one island to the next, the col1 value increases by more than 1, whereas the DENSE_RANK value increases by only 1. Therefore the difference for the next island will always be greater than the difference for the preceding island. In other words, the difference between col1 and the DENSE_RANK value is constant per island and unique per island. This means that it identifies the island. You can compute the starting and ending values per island by grouping the rows by the difference and returning the minimum and maximum col1 values per group.

This solution is very efficient and very elegant, but unfortunately it works only for the normal islands problem, in which there's a fixed interval between the entries. Any case in which one entry is greater than the previous entry by more than the fixed interval is considered a gap. To address special island problems, in which gaps of up to a certain size must be ignored, you need to find other solutions.

Special Islands

There are special island problems in which you need to ignore gaps of up to a certain size. For example, suppose that you need to identify islands in the col1 values in T1 but ignore gaps of up to a size of 2. For example, the values 7, 8, 9, and 11 are considered part of the same island because no pair has a difference of more than 2. However, the values 3 and 7 belong to different islands because the difference between the values is greater than 2. Figure 3 shows the desired result for this special case of the islands problem, for the small set of sample data in T1.

Figure 3: Desired Result for Special Islands Problem

First, I'll cover a solution that's supported by SQL Server 2005 and later. Then, I'll cover a couple of solutions that rely on new features in SQL Server 2012.

Solution to special islands problem, using subqueries. The following query is a solution to the special islands problem. This solution is supported by SQL Server 2005 and later:

WITH S AS

(

SELECT col1,

ROW_NUMBER() OVER(ORDER BY col1) AS n

FROM dbo.T1 AS A

WHERE NOT EXISTS

(SELECT *

FROM dbo.T1 AS B

WHERE B.col1 >= A.col1 - 2

AND B.col1 < A.col1)

),

E AS

(

SELECT col1,

ROW_NUMBER() OVER(ORDER BY col1) AS n

FROM dbo.T1 AS A

WHERE NOT EXISTS

(SELECT *

FROM dbo.T1 AS B

WHERE B.col1 > A.col1

AND B.col1 <= A.col1 + 2)

)

SELECT S.col1 AS rangestart, E.col1 AS rangeend

FROM S JOIN E

ON S.n = E.n;

The query that defines the CTE S filters only the col1 values from T1 that start islands (a value greater than the current value minus 2 and less than the current value can't be found). The query assigns row numbers based on col1 ordering to the remaining values.

Similarly, the query that defines the CTE E filters only the col1 values from T1 that end islands (a value greater than the current value and less than the current value plus 2 can't be found). The query assigns row numbers to the remaining values based on col1 ordering. The outer query then relates to each start of an island the respective end of the island by matching the row numbers of the start and end values.

I obtained the following performance results for the query:

Single query, few gaps: 8 seconds, 66298061 logical reads
Single query, many gaps: 14 seconds, 56356747 logical reads
Stress test, few gaps: 142 seconds, 63798069 logical reads
Stress test, many gaps: 188 seconds, 56356747 logical reads

As you can see, the solution involves a significant number of logical reads.

Solution to special islands problem, using LAG and LEAD. Recall from one of the previous solutions that the LEAD function lets you return an element from the next row based on the indicated ordering. Similarly, the LAG function lets you return an element from the previous row based on the indicated ordering. Both functions are new in SQL Server 2012. Here's a new solution to the special islands problem, using the LAG and LEAD functions:

WITH C1 AS

(

SELECT col1,

CASE WHEN col1 - LAG(col1) OVER(ORDER BY col1) <= 2 THEN 0 ELSE 1 END AS isstart,

CASE WHEN LEAD(col1) OVER(ORDER BY col1) - col1 <= 2 THEN 0 ELSE 1 END AS isend

FROM dbo.T1

),

C2 AS

(

SELECT col1 AS rangestart,

CASE

WHEN isend = 1 THEN col1

ELSE LEAD(col1, 1) OVER(ORDER BY col1)

END AS rangeend, isstart

FROM C1

WHERE isstart = 1 OR isend = 1

)

SELECT rangestart, rangeend

FROM C2

WHERE isstart = 1;

The query that defines the CTE C1 uses the LAG and LEAD functions to compute the attributes isstart and isend. The former returns a 0 if the difference between the current and previous col1 values is less than or equal to 2; otherwise it returns a 1. The latter returns a 0 if the difference between the next and current col1 values is less than or equal to 2; otherwise it returns a 1. As you probably figured out, a 1 in the isstart flag means that the value is the start of an island, and a 1 in the isend flag means that the value is the end of an island.

The query that defines the CTE C2 filters only values that are starts or ends of islands and then uses a CASE expression to compute for each start value the respective end value. If there's only one value in the island (both isstart and isend are equal to 1), the CASE expression returns the current value as the end value; otherwise it uses the LEAD function to return the next value as the end value.

I obtained the following performance results for the query:

Single query, few gaps: 26 seconds, 16138 logical reads
Single query, many gaps: 24 seconds, 13718 logical reads
Stress test, few gaps: 117 seconds, 16138 logical reads
Stress test, many gaps: 120 seconds, 13718 logical reads

What's interesting is that the number of logical reads is significantly lower than in the previous solution. Observe that in the single query test, this solution is slower than the previous solution, but in the stress test it performs better -- likely due to less contention related to I/O resources.

Solution to special islands problem, using LAG and COUNT. Here's another new solution to the special islands problem. This solution relies on new windowing capabilities in SQL Server 2012:

WITH C1 AS

(

SELECT col1,

CASE WHEN col1 - LAG(col1) OVER(ORDER BY col1) <= 2 THEN NULL ELSE 1 END AS isstart

FROM dbo.T1

),

C2 AS

(

SELECT col1, COUNT(isstart) OVER(ORDER BY col1 ROWS UNBOUNDED PRECEDING) AS grp

FROM C1

)

SELECT MIN(col1) AS rangestart, MAX(col1) AS rangeend

FROM C2

GROUP BY grp;

The query that defines the CTE C1 uses the LAG function to generate an isstart flag similar to the one in the previous solution. When the value is the start of an island, the flag is set to 1. However, unlike in the previous solution, when the value isn't the start of an island, the flag is set to NULL.

The query that defines the CTE C2 then uses the enhanced COUNT window function to generate the attribute grp to identify each island. The COUNT function computes the running count of the isstart flag based on col1 ordering from the first row in the sequence and until the current row (ROWS UNBOUNDED PRECEDING). Because the COUNT function ignores NULLs, it generates the value 1 for all members of the first island, the value 2 for all members of the second island, and so on. Finally, the outer query groups the rows by the grp value and returns the minimum and maximum col1 values in each group as the start and end values of the range.

I obtained the following performance results for the query:

Single query, few gaps: 28 seconds, 16138 logical reads
Single query, many gaps: 31 seconds, 16186 logical reads
Stress test, few gaps: 146 seconds, 16138 logical reads
Stress test, many gaps: 154 seconds, 13718 logical reads

The run time results for this solution are a bit slower than in the previous solution. However, this solution also generates a significantly smaller number of logical reads compared with the SQL Server 2005–compatible solution.

Comments

Plain text

Solving Gaps and Islands with Enhanced Window Functions

Sample Data and Testing

Gaps

Normal Islands

Special Islands

More Options Lead to Better Solutions

Comments

Plain text