missing puzzle piece

The Last non NULL Puzzle

Two solutions to a common problem

Download the Code iconOne of my fellow SQL Server MVPs, Simon Sabin, recently introduced a puzzle in a private group. Initially, the task seemed to be straightforward; however, as it turned out, solving the puzzle efficiently requires some creativity.

I asked Simon's permission to share this puzzle with SQL Server Pro readers. The need seems to be common enough that I think readers will be interested in a discussion of the topic, as well as finding efficient solutions.

The Puzzle

The puzzle is pretty simple. (Aren't simple puzzles the best ones?) Given a table T1, with a key column called id and a NULLable value column called col1, return the last non NULL col1 value based on id order.

Run the code in Listing 1 to create the table T1 and fill it with a small set of sample data.

-- DDL for T1
SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;
GO
CREATE TABLE dbo.T1
(
  id INT NOT NULL CONSTRAINT PK_T1 PRIMARY KEY,
  col1 INT NULL
);

-- Small set of sample data
TRUNCATE TABLE dbo.T1;

INSERT INTO dbo.T1(id, col1) VALUES
  ( 2, NULL),
  ( 3,   10),
  ( 5,   -1),
  ( 7, NULL),
  (11, NULL),
  (13,  -12),
  (17, NULL),
  (19, NULL),
  (23, 1759);

Figure 1 shows the desired result with the small set of sample data.

id      col1    lastval
----------- ----------- -----------
2       NULL    NULL
3       10      10
5       -1      -1
7       NULL    -1
11      NULL    -1
13      -12     -12
17      NULL    -12
19      NULL    -12
23      1759    1759

You might have faced a similar need yourself—but in case you haven't and are looking for a practical scenario, consider the following.

T1 stores a sequence of events, with the id column representing their order. An event has a number of attributes, one of which is represented by col1. When an attribute value changes, the event row shows the new value; for attributes whose values are unchanged, the event row shows NULLs.

Your task is to show with every event the applicable values (last non NULL) of all attributes. This is a simplified example of what Simon had in mind when he presented the puzzle. However, the simplified case is adequate in representing the fundamental need for an efficient solution for finding the last non NULL.

As always, I urge you to try to solve the puzzle yourself before looking at my solutions. It's an excellent challenge and can help sharpen your skills in using window functions, irrespective of the practicality of this specific case.

Use the small set of sample data that Listing 1 generates to check the validity of your solution. Compare the result that you get with the desired result shown in Figure 1.

Use the code in Listing 2 to populate the table with 10,000,000 rows so that you can test the performance of your solution.

-- Definition of GetNums: http://tsql.solidq.com/GetNums.txt
TRUNCATE TABLE dbo.T1;

INSERT INTO dbo.T1 WITH(TABLOCK)
  SELECT n AS id, CHECKSUM(NEWID()) AS col1
  FROM TSQLV3.dbo.GetNums(1, 10000000) AS Nums
OPTION(MAXDOP 1);

The performance numbers that I present were measured against the large set of sample data, with results discarded.

Solution 1 Using Only Window Functions

The first solution that I discuss uses two layers of the MAX window aggregate function. There are two main steps in the solution. The following code implements the first step:

SELECT id, col1, relevantid,
  MAX(relevantid) OVER( ORDER BY id
            ROWS UNBOUNDED PRECEDING ) AS grp
FROM dbo.T1
  CROSS APPLY ( VALUES( CASE WHEN col1 IS NOT NULL THEN id END ) )
    AS A(relevantid);

To make it easy to understand what this step does, follow my explanations while examining the output of this step, which Figure 2 shows.

id      col1    relevantid  grp
----------- ----------- ----------- -----------
2       NULL    NULL    NULL
3       10      3       3
5       -1      5       5
7       NULL    NULL    5
11      NULL    NULL    5
13      -12     13      13
17      NULL    NULL    13
19      NULL    NULL    13
23      1759    23      23

The code uses a CASE expression to compute a result column called relevantid. This column shows the value of id when the value of col1 changes and a NULL when it doesn't. The critical thing about id is that by definition its values keep increasing; that's not necessarily the case with the values in col1. The MAX window aggregate function returns the maximum relevantid value so far (call the result column grp).

Observe in Figure 2 that the group value (grp) changes to the current id whenever the col1 value changes, and it remains the same as long as the col1 value doesn't change. So, the maximum col1 value within each group of events is the applicable col1 value for those events.

The second step in the solution implements the last calculation, like so:

WITH C AS
(
  SELECT id, col1, relevantid,
    MAX(relevantid) OVER( ORDER BY id
              ROWS UNBOUNDED PRECEDING ) AS grp
  FROM dbo.T1
    CROSS APPLY ( VALUES( CASE WHEN col1 IS NOT NULL THEN id END ) )
  AS A(relevantid)
)
SELECT *,
  MAX(col1) OVER( PARTITION BY grp
          ORDER BY id
          ROWS UNBOUNDED PRECEDING ) AS maxcol1ingrp
FROM C;

Figure 3 shows the output of this step.

id      col1    relevantid  grp     maxcol1ingrp
----------- ----------- ----------- ----------- ------------
2       NULL    NULL    NULL    NULL
3       10      3       3       10
5       -1      5       5       -1
7       NULL    NULL    5       -1
11      NULL    NULL    5       -1
13      -12     13      13      -12
17      NULL    NULL    13      -12
19      NULL    NULL    13      -12
23      1759    23      23      1759

The result column maxcol1ingrp is the last non NULL you were looking for. Here's the complete solution after inlining the computation of relevantid:

WITH C AS
(
  SELECT id, col1,
    MAX( CASE WHEN col1 IS NOT NULL THEN id END )
  OVER( ORDER BY id
        ROWS UNBOUNDED PRECEDING ) AS grp
  FROM dbo.T1
)
SELECT id, col1,
  MAX(col1) OVER( PARTITION BY grp
          ORDER BY id
          ROWS UNBOUNDED PRECEDING ) AS lastval
FROM C;

Figure 4 shows the plan for this solution against the large set of sample data.

Plan for Solution 1
Figure 4: Plan for Solution 1

The first window aggregate function relies on index order. However, because the second window aggregate function uses a different window specification than the first, it requires an explicit sort. That's the most expensive part of the plan. It took this solution 66 seconds to complete on my system.

The second solution that I present eliminates the need for explicit sorting.

Solution 2 Using Concatenation and a Window Function

The second solution can also be split into two steps. The following code implements the first step:

SELECT id, col1, binstr,
  MAX( binstr ) OVER( ORDER BY id
          ROWS UNBOUNDED PRECEDING ) AS mx
FROM dbo.T1
  CROSS APPLY ( VALUES( CAST(id AS BINARY(4)) + CAST(col1 AS BINARY(4)) ) )
    AS A(binstr);

Again, follow my explanations while observing the code and its output, which Figure 5 shows.

id      col1    binstr         mx
----------- ----------- ------------------ ------------------
2       NULL    NULL           NULL
3       10      0x000000030000000A 0x000000030000000A
5       -1      0x00000005FFFFFFFF 0x00000005FFFFFFFF
7       NULL    NULL           0x00000005FFFFFFFF
11      NULL    NULL           0x00000005FFFFFFFF
13      -12     0x0000000DFFFFFFF4 0x0000000DFFFFFFF4
17      NULL    NULL           0x0000000DFFFFFFF4
19      NULL    NULL           0x0000000DFFFFFFF4
23      1759    0x00000017000006DF 0x00000017000006DF

This step converts the id value and the col1 value to binary strings and concatenates them. The code names the result column binstr. Observe in Figure 5 that when col1 changes its value, binstr shows the concatenated strings; when col1 doesn't change its value, binstr is NULL. The MAX window aggregate function computes the maximum binstr value so far. The code calls the result column mx.

The right four bytes within mx represent the binary form of the last non NULL value of col1. The second step in the solution extracts those right four bytes and converts them to INT to return the desired result with the proper type. Here's the code that implements step 2, which is also the complete solution:

SELECT id, col1,
  CAST(
    SUBSTRING(
  MAX( CAST(id AS BINARY(4)) + CAST(col1 AS BINARY(4)) )
    OVER( ORDER BY id
      ROWS UNBOUNDED PRECEDING ),
  5, 4)
    AS INT) AS lastval
FROM dbo.T1;

Figure 6 shows the plan for this solution.

Plan for Solution 2
Figure 6: Plan for Solution 2

The beauty of this solution is that it involves only one window aggregate function—and that function relies on index order, as you can see in the query plan. No explicit sorting is required. This solution took 33 seconds to complete on my system.

Conclusion

The last non NULL task is a beautiful puzzle for several reasons. It's a common and simple need, but apparently there's no straightforward solution. Creating an efficient solution requires a good understanding of window functions and how they are optimized, as well as some creativity.

I demonstrated two different solutions. One solution uses only window functions but involves an explicit sort in the plan. The other solution uses a combination of concatenation and a window function and doesn't involve an explicit sort in the plan.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish