T-SQL Feature Request: Add RESET WHEN Clause to Reset Window Partition

Window functions allow you to solve a wide variety of T-SQL querying tasks elegantly and efficiently. Still, as of SQL Server 2017, there are some tasks that are difficult to solve with efficient set-based solutions that could be handled easily and efficiently if T-SQL added support for a windowing clause called RESET WHEN. This clause resets the window partition when a certain condition—possibly one that is based on a window function—is met. This feature isn’t standard, but is currently supported by Teradata (see documentation article on the feature here). Thanks to Microsoft Data Platform MVP Alejandro Mesa for introducing the feature to me. In this article, I present a couple of classic, representative T-SQL querying tasks, provide the solutions that are used currently, and explain how you would handle them with improved solutions using the RESET WHEN clause. Hopefully, by the end of this article you will be convinced of the importance of this feature and will add your vote to the connect item requesting it.

Depleting Quantities

I’ll start with a challenge involving the computation of running totals that need to be depleted when a certain condition is met. Microsoft Data Platform MVP Geri Reshef presented the original challenge; later, Sharon Rimer of Naya Technologies presented a variation of the challenge.

The sample data for this challenge involves a table called Transactions, which you create and populate using the following code:

SET NOCOUNT ON;
USE tempdb;

DROP TABLE IF EXISTS dbo.Transactions;
GO
CREATE TABLE dbo.Transactions
(
  txid INT NOT NULL CONSTRAINT PK_Transactions PRIMARY KEY,
  qty  INT NOT NULL
);
GO

TRUNCATE TABLE dbo.Transactions;

INSERT INTO dbo.Transactions(txid, qty)
  VALUES(1,2),(2,5),(3,4),(4,1),(5,10),(6,3),
        (7,1),(8,2),(9,1),(10,2),(11,1),(12,9);

The transactions keep adding quantities of some item to a container based on txid ordering. As soon as the cumulative quantity exceeds the container’s capacity (given as input), the container needs to be depleted. Your solution needs to show the state of the container (total quantity) after every transaction, with 0 as the total quantity after depletion, as well as the depletion quantity when relevant. Here’s the desired result for the given sample data and an input container capacity of 5:

txid        qty         totalqty    depletionqty
----------- ----------- ----------- ------------
1           2           2           0
2           5           0           7
3           4           4           0
4           1           5           0
5           10          0           15
6           3           3           0
7           1           4           0
8           2           0           6
9           1           1           0
10          2           3           0
11          1           4           0
12          9           0           13

Solution Based on Cursor

It’s common for people to try and solve such tasks using recursive queries. Such solutions can be elegant, but aren’t very efficient. People also solve such tasks using a method known as “quirky update,” which is very efficient, but not guaranteed since it relies on physical processing order. So far, I haven’t found efficient, guaranteed, set-based solutions for this task, and instead have been reluctantly using iterative solutions (T-SQL- or CLR-based).

Here’s an example for a simple iterative T-SQL solution using a cursor:

SET NOCOUNT ON;

DECLARE @maxallowedqty AS INT = 5;

DECLARE @C AS CURSOR, @txid AS INT, @qty AS INT, @totalqty AS INT = 0, @depletionqty AS INT = 0;

DECLARE @Result AS TABLE
(
  txid         INT NOT NULL PRIMARY KEY,
  qty          INT NOT NULL,
  totalqty     INT NOT NULL,
  depletionqty INT NOT NULL
);

SET @C = CURSOR FORWARD_ONLY STATIC READ_ONLY FOR
  SELECT txid, qty
  FROM dbo.Transactions
  ORDER BY txid;

OPEN @C;

FETCH NEXT FROM @C INTO @txid, @qty;

WHILE @@FETCH_STATUS = 0
BEGIN
  SELECT @totalqty += @qty, @depletionqty = 0;

  IF @totalqty > @maxallowedqty
  BEGIN
    SET @depletionqty = @totalqty;
    SET @totalqty = 0;
  END;

  INSERT INTO @Result(txid, qty, totalqty, depletionqty)
    VALUES(@txid, @qty, @totalqty, @depletionqty);

  FETCH NEXT FROM @C INTO @txid, @qty;
END;

SELECT txid, qty, totalqty, depletionqty
FROM @Result
ORDER BY txid;

The code defines a cursor variable and uses it to fetch the transactions one at a time in chronological order. It accumulates the quantities in a variable called @totalqty. After every row fetched, the code checks if the cumulative quantity exceeds the container capacity, in which case it sets a variable called @depletionqty to the current @totalqty value, and then zeroes the current @totalqty value. The code then writes the current transaction info (txid and qty) along with the current @totalqty and @depletionqty values to a table variable. Once done iterating through all transactions, the code queries the table variable to produce the desired result.

Solution Using RESET WHEN (Not Supported as of SQL Server 2017)

The downsides of using iterative solutions are well known. The question is always whether a descent set-based alternative exists. So far, I haven’t found one myself for the task at hand with the existing tools that we have in T-SQL, and would love to see one created some day. As mentioned, Teradata supports a windowing clause called RESET WHEN that resets the window partition when a certain condition is met. The beauty in this clause is that the condition can use a window function, so you can ask about what accumulated until the previous row. With our task you would reset the window partition when the sum of the quantities from the beginning of the partition and until the previous row exceeds the input container limit, like so:

(Remember, this code currently isn’t supported in SQL Server.)

DECLARE @maxallowedqty AS INT = 5;

SELECT txid, qty,
  SUM(qty) OVER(ORDER BY txid
                RESET WHEN
                  -- reset window partition when 
                  -- running sum until previous row > @maxallowedqty
                  SUM(qty) OVER(ORDER BY txid
                                ROWS BETWEEN UNBOUNDED PRECEDING
                                         AND 1 PRECEDING)
                    > @maxallowedqty
                ROWS UNBOUNDED PRECEDING) AS runsum
FROM dbo.Transactions;

Had this code been supported, it would have produced the following output:


txid        qty         runsum   
----------- ----------- -----------
1           2           2          
2           5           7          
-----------------------------------
3           4           4          
4           1           5          
5           10          15          
-----------------------------------
6           3           3          
7           1           4          
8           2           6          
-----------------------------------
9           1           1          
10          2           3          
11          1           4          
12          9           13

As you can see, the window partition is reset after transaction IDs 2, 5 and 8.

To produce the final desired result, you set the total container quantity (call it totalqty) to zero when the running sum exceeds the container limit, and to the running sum otherwise. You then compute the depletion quantity (call it depletionqty) as the running sum minus the total quantity. Here’s the complete solution’s code:

DECLARE @maxallowedqty AS INT = 5;

WITH C AS
(
  SELECT *,
    SUM(qty) OVER(ORDER BY txid
                  RESET WHEN
                    SUM(qty) OVER(ORDER BY txid
                                  ROWS BETWEEN UNBOUNDED PRECEDING
                                           AND 1 PRECEDING)
                      > @maxallowedqty
                  ROWS UNBOUNDED PRECEDING) AS runsum
  FROM dbo.Transactions
)
SELECT txid, qty, totalqty, runsum - totalqty AS depletionqty
FROM C
  CROSS APPLY
    ( VALUES( CASE WHEN runsum > @maxallowedqty THEN 0 ELSE runsum END ) )
      AS A(totalqty);

As you can see, the solution is simple, concise and elegant!

Recently, Sharon Rimer of Naya Technologies introduced a variation of this challenge based on one of the company's customer cases where the desired result was to compute how many times the container exceeded the input limit. To handle this need, you would keep the same definition for the CTE C, and use the following outer query to compute the count:

SELECT COUNT( CASE WHEN runsum > @maxallowedqty THEN 1 END ) AS timesexceeded
FROM C;

The complete solution would look like this:

WITH C AS
(
  SELECT *,
    SUM(qty) OVER(ORDER BY txid
                  RESET WHEN
                    SUM(qty) OVER(ORDER BY txid
                                  ROWS BETWEEN UNBOUNDED PRECEDING
                                           AND 1 PRECEDING)
                      > @maxallowedqty
                  ROWS UNBOUNDED PRECEDING) AS runsum
  FROM dbo.Transactions
)
SELECT COUNT( CASE WHEN runsum > @maxallowedqty THEN 1 END ) AS timesexceeded
FROM C;

Again, short and elegant.

Complex Islands

There are many other challenges that perhaps do currently have a reasonable T-SQL set-based solution, but that can be solved more easily and elegantly with the RESET WHEN clause. A good example is handling complex islands tasks where you need to define a new island whenever a condition that compares something from the current row and something from the previous row is met. To demonstrate such a challenge, I’ll use a table called Stocks, which you create and populate by running the following code:

SET NOCOUNT ON;
USE tempdb;

DROP TABLE IF EXISTS dbo.StockRates;
GO
CREATE TABLE dbo.StockRates
(
  stockid  INT  NOT NULL,
  dt       DATE NOT NULL,
  val      INT  NOT NULL,
  CONSTRAINT PK_StockRates PRIMARY KEY(stockid, dt)
);
GO

INSERT INTO dbo.StockRates VALUES
  (1, '2017-08-01', 13),
  (1, '2017-08-02', 14),
  (1, '2017-08-03', 17),
  (1, '2017-08-04', 40),
  (1, '2017-08-05', 45),
  (1, '2017-08-06', 52),
  (1, '2017-08-07', 56),
  (1, '2017-08-08', 60),
  (1, '2017-08-09', 70),
  (1, '2017-08-10', 30),
  (1, '2017-08-11', 29),
  (1, '2017-08-12', 35),
  (1, '2017-08-13', 40),
  (1, '2017-08-14', 45),
  (1, '2017-08-15', 60),
  (1, '2017-08-16', 60),
  (1, '2017-08-17', 55),
  (1, '2017-08-18', 60),
  (1, '2017-08-19', 20),
  (1, '2017-08-20', 15),
  (1, '2017-08-21', 20),
  (1, '2017-08-22', 30),
  (1, '2017-08-23', 40),
  (1, '2017-08-24', 20),
  (1, '2017-08-25', 60),
  (1, '2017-08-26', 80),
  (1, '2017-08-27', 70),
  (1, '2017-08-28', 70),
  (1, '2017-08-29', 40),
  (1, '2017-08-30', 30),
  (1, '2017-08-31', 10),
  (2, '2017-08-01', 3),
  (2, '2017-08-02', 4),
  (2, '2017-08-03', 7),
  (2, '2017-08-04', 30),
  (2, '2017-08-05', 35),
  (2, '2017-08-06', 42),
  (2, '2017-08-07', 46),
  (2, '2017-08-08', 50),
  (2, '2017-08-09', 60),
  (2, '2017-08-10', 20),
  (2, '2017-08-11', 19),
  (2, '2017-08-12', 25),
  (2, '2017-08-13', 30),
  (2, '2017-08-14', 35),
  (2, '2017-08-15', 50),
  (2, '2017-08-16', 50),
  (2, '2017-08-17', 45),
  (2, '2017-08-18', 50),
  (2, '2017-08-19', 10),
  (2, '2017-08-20', 5),
  (2, '2017-08-21', 10),
  (2, '2017-08-22', 20),
  (2, '2017-08-23', 30),
  (2, '2017-08-24', 10),
  (2, '2017-08-25', 50),
  (2, '2017-08-26', 70),
  (2, '2017-08-27', 60),
  (2, '2017-08-28', 60),
  (2, '2017-08-29', 30),
  (2, '2017-08-30', 20),
  (2, '2017-08-31', 1);

In this table, you keep track of daily closing stock rates. Say you need to identify periods (islands) where the stock value is greater than or equal to 50, and for each period you need to show when it started and ended, as well as what the maximum stock rate was during that period. The catch is that you need to tolerate (ignore) gaps of up to 6 days. Here’s the desired result for the given sample data:

stockid     startdate  enddate    maxvalue
----------- ---------- ---------- -----------
1           2017-08-06 2017-08-18 70
1           2017-08-25 2017-08-28 80
2           2017-08-08 2017-08-18 60
2           2017-08-25 2017-08-28 70

Observe, for example, that for the first island for stock ID 1 we ignored the gap between Aug. 9, 2017, and Aug. 15, 2017, since it wasn’t longer than 6 days, but we didn’t ignore the gap between Aug. 18, 2017, and Aug. 25, 2017, since it was 7-days long.

Currently Supported T-SQL Solution

As mentioned, there are currently supported set-based solutions for islands challenges like this one, but they are longer and more complex than how you would handle such tasks with the RESET WHEN clause. With one of the currently supported solutions, as the first step you compute a flag (call it isstart) that you set to 0 when it’s not the beginning of an island by comparing something from the current row and with something from the previous row (obtained with the LAG function). Otherwise, you set the flag to 1. In our case, after filtering only rows where the stock value is greater than or equal to 50, you set the flag to 0 when the difference between the previous date and the current date is less than 7 days; otherwise, you set the flag to 1. Here’s the code that implements this step:

SELECT stockid, dt, val,
  CASE
    WHEN DATEDIFF(day, LAG(dt) OVER(PARTITION BY stockid ORDER BY dt), dt)
           < 7
      THEN 0
    ELSE 1
  END AS isstart
FROM dbo.StockRates
WHERE val >= 50;

This code generates the following output:

stockid     dt         val         isstart
----------- ---------- ----------- -----------
1           2017-08-06 52          1
1           2017-08-07 56          0
1           2017-08-08 60          0
1           2017-08-09 70          0
1           2017-08-15 60          0
1           2017-08-16 60          0
1           2017-08-17 55          0
1           2017-08-18 60          0
1           2017-08-25 60          1
1           2017-08-26 80          0
1           2017-08-27 70          0
1           2017-08-28 70          0
2           2017-08-08 50          1
2           2017-08-09 60          0
2           2017-08-15 50          0
2           2017-08-16 50          0
2           2017-08-18 50          0
2           2017-08-25 50          1
2           2017-08-26 70          0
2           2017-08-27 60          0
2           2017-08-28 60          0

As a second step, you produce an island identifier by computing the running sum of the isstart flag. Finally, you group the data by stockid and isstart, and return the start and end dates of the island, plus the maximum stock rate during the period. Here’s the complete solution’s code:

WITH C1 AS
(
  SELECT *,
    CASE
      WHEN DATEDIFF(day, LAG(dt) OVER(PARTITION BY stockid ORDER BY dt), dt)
             < 7
        THEN 0
      ELSE 1
    END AS isstart
  FROM dbo.StockRates
  WHERE val >= 50
),
C2 AS
(
  SELECT *,
    SUM(isstart) OVER(PARTITION BY stockid ORDER BY dt
                      ROWS UNBOUNDED PRECEDING) AS grp
  FROM C1
)
SELECT stockid,
  MIN(dt) AS startdate,
  MAX(dt) AS enddate,
  MAX(val) as maxvalue
FROM C2
GROUP BY stockid, grp
ORDER BY stockid, startdate;

Solution Using RESET WHEN (Not Supported as of SQL Server 2017)

Using the RESET WHEN clause, it would have been easier to solve the task since you can simply reset the window partition when the condition for starting a new island is met. You could then use the minimum date in the partition is the island identifier.

Here’s an example of how you would compute the island identifier (call it grp):

SELECT stockid, dt, val,
  MIN(dt) OVER(PARTITION BY stockid
               ORDER BY dt
               RESET WHEN DATEDIFF(day, 
                 MIN(dt) OVER(
                   PARTITION BY stockid ORDER BY dt
                   ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING),
                 dt) >= 7
               ROWS UNBOUNDED PRECEDING) AS grp
FROM dbo.Stocks
WHERE val >= 50;

Here the code uses the MIN window function to obtain the date from the previous row by using the window frame extent ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING. Alternatively, you could get the date from the previous row with the LAG function, like so (showing only the alternative RESET WHEN clause):

RESET WHEN DATEDIFF(day, 
  LAG(dt) OVER(PARTITION BY stockid ORDER BY dt),
  dt) >= 7

Here’s what the output of this step would be, showing the computed group identifier for each island:


stockid     dt         val         grp
----------- ---------- ----------- -----------
1           2017-08-06 52          2017-08-06
1           2017-08-07 56          2017-08-06
1           2017-08-08 60          2017-08-06
1           2017-08-09 70          2017-08-06
1           2017-08-15 60          2017-08-06
1           2017-08-16 60          2017-08-06
1           2017-08-17 55          2017-08-06
1           2017-08-18 60          2017-08-06
1           2017-08-25 60          2017-08-25
1           2017-08-26 80          2017-08-25
1           2017-08-27 70          2017-08-25
1           2017-08-28 70          2017-08-25
2           2017-08-08 50          2017-08-08
2           2017-08-09 60          2017-08-08
2           2017-08-15 50          2017-08-08
2           2017-08-16 50          2017-08-08
2           2017-08-18 50          2017-08-08
2           2017-08-25 50          2017-08-25
2           2017-08-26 70          2017-08-25
2           2017-08-27 60          2017-08-25
2           2017-08-28 60          2017-08-25

Then you would need just one more step to group this data by stockid and grp, and return the beginning and end of each period, plus the maximum stock rate within the period, like so:

WITH C AS
(
  SELECT *,
    MIN(dt) OVER(PARTITION BY stockid
                 ORDER BY dt
                 RESET WHEN DATEDIFF(day, 
                   LAG(dt) OVER(PARTITION BY stockid ORDER BY dt),
                   dt) >= 7
                 ROWS UNBOUNDED PRECEDING) AS grp
  FROM dbo.Stocks
  WHERE val >= 50
)
SELECT stockid,
  grp AS startdate,
  MAX(dt) AS enddate,
  MAX(val) as maxvalue
FROM C
GROUP BY stockid, grp
ORDER BY stockid, grp;

As you can see, this solution is shorter and simpler than the currently supported one.

Conclusion

Window functions are profound; there’s no question about it. Wouldn’t you like to see them becoming even more powerful? The RESET WHEN clause, if implemented in T-SQL, will doubtless increase its utility and enable us to replace existing iterative solutions with elegant and more efficient set-based ones. If you find that the addition of this feature would be beneficial to you, be sure to add your vote to the connect item requesting it.

Comments

Plain text