One of my fellow SQL Server MVPs, Simon Sabin, recently introduced a puzzle in a private group. Initially, the task seemed to be straightforward; however, as it turned out, solving the puzzle efficiently requires some creativity.
I asked Simon's permission to share this puzzle with SQL Server Pro readers. The need seems to be common enough that I think readers will be interested in a discussion of the topic, as well as finding efficient solutions.
The puzzle is pretty simple. (Aren't simple puzzles the best ones?) Given a table T1, with a key column called id and a NULLable value column called col1, return the last non NULL col1 value based on id order.
Run the code in Listing 1 to create the table T1 and fill it with a small set of sample data.
-- DDL for T1 SET NOCOUNT ON; USE tempdb; IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1; GO CREATE TABLE dbo.T1 ( id INT NOT NULL CONSTRAINT PK_T1 PRIMARY KEY, col1 INT NULL ); -- Small set of sample data TRUNCATE TABLE dbo.T1; INSERT INTO dbo.T1(id, col1) VALUES ( 2, NULL), ( 3, 10), ( 5, -1), ( 7, NULL), (11, NULL), (13, -12), (17, NULL), (19, NULL), (23, 1759);
Figure 1 shows the desired result with the small set of sample data.
id col1 lastval ----------- ----------- ----------- 2 NULL NULL 3 10 10 5 -1 -1 7 NULL -1 11 NULL -1 13 -12 -12 17 NULL -12 19 NULL -12 23 1759 1759
You might have faced a similar need yourself—but in case you haven't and are looking for a practical scenario, consider the following.
T1 stores a sequence of events, with the id column representing their order. An event has a number of attributes, one of which is represented by col1. When an attribute value changes, the event row shows the new value; for attributes whose values are unchanged, the event row shows NULLs.
Your task is to show with every event the applicable values (last non NULL) of all attributes. This is a simplified example of what Simon had in mind when he presented the puzzle. However, the simplified case is adequate in representing the fundamental need for an efficient solution for finding the last non NULL.
As always, I urge you to try to solve the puzzle yourself before looking at my solutions. It's an excellent challenge and can help sharpen your skills in using window functions, irrespective of the practicality of this specific case.
Use the small set of sample data that Listing 1 generates to check the validity of your solution. Compare the result that you get with the desired result shown in Figure 1.
Use the code in Listing 2 to populate the table with 10,000,000 rows so that you can test the performance of your solution.
-- Definition of GetNums: http://tsql.solidq.com/GetNums.txt TRUNCATE TABLE dbo.T1; INSERT INTO dbo.T1 WITH(TABLOCK) SELECT n AS id, CHECKSUM(NEWID()) AS col1 FROM TSQLV3.dbo.GetNums(1, 10000000) AS Nums OPTION(MAXDOP 1);
The performance numbers that I present were measured against the large set of sample data, with results discarded.
Solution 1 Using Only Window Functions
The first solution that I discuss uses two layers of the MAX window aggregate function. There are two main steps in the solution. The following code implements the first step:
SELECT id, col1, relevantid, MAX(relevantid) OVER( ORDER BY id ROWS UNBOUNDED PRECEDING ) AS grp FROM dbo.T1 CROSS APPLY ( VALUES( CASE WHEN col1 IS NOT NULL THEN id END ) ) AS A(relevantid);
To make it easy to understand what this step does, follow my explanations while examining the output of this step, which Figure 2 shows.
id col1 relevantid grp ----------- ----------- ----------- ----------- 2 NULL NULL NULL 3 10 3 3 5 -1 5 5 7 NULL NULL 5 11 NULL NULL 5 13 -12 13 13 17 NULL NULL 13 19 NULL NULL 13 23 1759 23 23
The code uses a CASE expression to compute a result column called relevantid. This column shows the value of id when the value of col1 changes and a NULL when it doesn't. The critical thing about id is that by definition its values keep increasing; that's not necessarily the case with the values in col1. The MAX window aggregate function returns the maximum relevantid value so far (call the result column grp).
Observe in Figure 2 that the group value (grp) changes to the current id whenever the col1 value changes, and it remains the same as long as the col1 value doesn't change. So, the maximum col1 value within each group of events is the applicable col1 value for those events.
The second step in the solution implements the last calculation, like so:
WITH C AS ( SELECT id, col1, relevantid, MAX(relevantid) OVER( ORDER BY id ROWS UNBOUNDED PRECEDING ) AS grp FROM dbo.T1 CROSS APPLY ( VALUES( CASE WHEN col1 IS NOT NULL THEN id END ) ) AS A(relevantid) ) SELECT *, MAX(col1) OVER( PARTITION BY grp ORDER BY id ROWS UNBOUNDED PRECEDING ) AS maxcol1ingrp FROM C;
Figure 3 shows the output of this step.
id col1 relevantid grp maxcol1ingrp ----------- ----------- ----------- ----------- ------------ 2 NULL NULL NULL NULL 3 10 3 3 10 5 -1 5 5 -1 7 NULL NULL 5 -1 11 NULL NULL 5 -1 13 -12 13 13 -12 17 NULL NULL 13 -12 19 NULL NULL 13 -12 23 1759 23 23 1759
The result column maxcol1ingrp is the last non NULL you were looking for. Here's the complete solution after inlining the computation of relevantid:
WITH C AS ( SELECT id, col1, MAX( CASE WHEN col1 IS NOT NULL THEN id END ) OVER( ORDER BY id ROWS UNBOUNDED PRECEDING ) AS grp FROM dbo.T1 ) SELECT id, col1, MAX(col1) OVER( PARTITION BY grp ORDER BY id ROWS UNBOUNDED PRECEDING ) AS lastval FROM C;
Figure 4 shows the plan for this solution against the large set of sample data.
The first window aggregate function relies on index order. However, because the second window aggregate function uses a different window specification than the first, it requires an explicit sort. That's the most expensive part of the plan. It took this solution 66 seconds to complete on my system.
The second solution that I present eliminates the need for explicit sorting.
Solution 2 Using Concatenation and a Window Function
The second solution can also be split into two steps. The following code implements the first step:
SELECT id, col1, binstr, MAX( binstr ) OVER( ORDER BY id ROWS UNBOUNDED PRECEDING ) AS mx FROM dbo.T1 CROSS APPLY ( VALUES( CAST(id AS BINARY(4)) + CAST(col1 AS BINARY(4)) ) ) AS A(binstr);
Again, follow my explanations while observing the code and its output, which Figure 5 shows.
id col1 binstr mx ----------- ----------- ------------------ ------------------ 2 NULL NULL NULL 3 10 0x000000030000000A 0x000000030000000A 5 -1 0x00000005FFFFFFFF 0x00000005FFFFFFFF 7 NULL NULL 0x00000005FFFFFFFF 11 NULL NULL 0x00000005FFFFFFFF 13 -12 0x0000000DFFFFFFF4 0x0000000DFFFFFFF4 17 NULL NULL 0x0000000DFFFFFFF4 19 NULL NULL 0x0000000DFFFFFFF4 23 1759 0x00000017000006DF 0x00000017000006DF
This step converts the id value and the col1 value to binary strings and concatenates them. The code names the result column binstr. Observe in Figure 5 that when col1 changes its value, binstr shows the concatenated strings; when col1 doesn't change its value, binstr is NULL. The MAX window aggregate function computes the maximum binstr value so far. The code calls the result column mx.
The right four bytes within mx represent the binary form of the last non NULL value of col1. The second step in the solution extracts those right four bytes and converts them to INT to return the desired result with the proper type. Here's the code that implements step 2, which is also the complete solution:
SELECT id, col1, CAST( SUBSTRING( MAX( CAST(id AS BINARY(4)) + CAST(col1 AS BINARY(4)) ) OVER( ORDER BY id ROWS UNBOUNDED PRECEDING ), 5, 4) AS INT) AS lastval FROM dbo.T1;
Figure 6 shows the plan for this solution.
The beauty of this solution is that it involves only one window aggregate function—and that function relies on index order, as you can see in the query plan. No explicit sorting is required. This solution took 33 seconds to complete on my system.
The last non NULL task is a beautiful puzzle for several reasons. It's a common and simple need, but apparently there's no straightforward solution. Creating an efficient solution requires a good understanding of window functions and how they are optimized, as well as some creativity.
I demonstrated two different solutions. One solution uses only window functions but involves an explicit sort in the plan. The other solution uses a combination of concatenation and a window function and doesn't involve an explicit sort in the plan.