Suppose you have a table that contains a sequence of values. The values were generated as consecutive values, but after some deletions you now have gaps. Your task is to identify the minimum missing value after the minimum existing value.
As an example, the following code creates a table called T1 with a column called col1 and populates it with close to 10,000,000 rows with some gaps (note the link to download the source code for the helper function GetNums):
SET NOCOUNT ON; USE tempdb; IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1; GO SELECT ISNULL(CAST(n AS INT), 0) AS col1 INTO dbo.T1 FROM dbo.GetNums(1, 10000000) AS Nums -- http://tsql.solidq.com/getnums.txt WHERE n % 1000 <> 0 OPTION (MAXDOP 1); ALTER TABLE dbo.T1 ADD CONSTRAINT PK_T1 PRIMARY KEY(col1);
The primary key constraint causes a clustered index to be created with col1 as the key.
As I mentioned, your task is to identify the minimum missing key after the minimum existing key. Of course, you want the solution to be as efficient as possible. Try to come up with a solution of your own before looking at mine.
Run the following code to enable I/O and time statistics:
SET STATISTICS IO, TIME ON;
My first attempt at a solution was using the following query:
SELECT MIN(A.col1) + 1 AS val FROM dbo.T1 AS A WHERE NOT EXISTS (SELECT * FROM dbo.T1 AS B WHERE B.col1 = A.col1 + 1);
This query uses a NOT EXISTS predicate to filter values in T1 (as A) if the A value plus 1 doesn't exist in T1 (as B). Then the outer query returns the minimum of the filtered values in A plus 1. Examine the execution plan for this query, which Figure 1 shows.
The index on col1 is scanned twice in order, once for B and once for A, to support a Merge Join (Right Anti Semi Join) operator. Because the rows are streaming out of the Merge operator based on col1 order, theoretically the plan could have applied a short circuit as soon as the first row was received. That first received value by the Aggregate operator is already the qualifying value, but the operator seems to be unaware of this. Unfortunately, both instances of the index are scanned to completion, the Merge operator processes all rows, and the Aggregate operator processes all rows. This query took 5 seconds to run on my system, and it performed 32,194 logical reads.
In attempt to get a short circuit, I replaced the use of the MIN aggregate with a TOP (1) A.col1 + 1, ordered by A.col1, like so:
SELECT TOP (1) A.col1 + 1 AS val FROM dbo.T1 AS A WHERE NOT EXISTS (SELECT * FROM dbo.T1 AS B WHERE B.col1 = A.col1 + 1) ORDER BY A.col1;
Figure 2 shows the plan for this query.
I expected the plan to use a Top operator that short-circuits the activity after one row is received, but again, it scans the inputs to completion and uses a Sort (Top N Sort) operator to find the minimum value. This solution also took 5 seconds to complete, performing 32,194 reads.
The curious thing to observe in the plan is that the Merge operator's Where property compares A.col1 + 1 (Expr1003) with B.col1. So the rows are assumed to be returned from the Merge operator sorted by A.col1 + 1, but the query's ORDER BY clause is based on A.col1 (without the + 1). A simple fix is to change the ORDER BY clause to A.col1 + 1, like so:
SELECT TOP (1) A.col1 + 1 AS val FROM dbo.T1 AS A WHERE NOT EXISTS (SELECT * FROM dbo.T1 AS B WHERE B.col1 = A.col1 + 1) ORDER BY A.col1 + 1; -- or ORDER BY val
Figure 3 shows the plan for this query is shown.
This time, there's a Top operator that short-circuits the work after one row is received. Therefore, the scans don't continue beyond the first point where a match was found. This query ran for only 4 milliseconds on my system, performing only 26 reads.
Another solution that I found to be quite efficient is based on using the LEAD function, like so:
WITH C AS ( SELECT col1, LEAD(col1) OVER(ORDER BY col1) AS nxt FROM dbo.T1 ) SELECT TOP (1) col1 + 1 FROM C WHERE nxt - col1 > 1 OR nxt IS NULL ORDER BY col1;
You use the LEAD function to return for each current value (call it cur) the next value (call it nxt). You enclose the query in a CTE (call it C). Then the outer query filters only the rows where the difference between cur and nxt is greater than 1, or nxt is null, and with a TOP (1) filter ordered by col1 returns the first match. Figure 4 shows the plan for this query.
The rows are scanned from the index on col1 in order, and a Top operator short-circuits the work after the first match is found. This plan ran for 40 milliseconds on my system, performing only 4 reads.
When you're done, run the following code to disable reporting statistics and to drop the table:
SET STATISTICS IO, TIME ON; IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;