T-SQL Interval Graphs Challenge, Part 2

Complete solution to the drug prescription interval packing problem

In "T-SQL Interval Graphs Challenge, Part 1," I introduced a puzzle that was originally given to me by John Paul Cook, who is a registered nurse and a fellow SQL Server MVP. The puzzle involves counting drug exposure based on data modeled as an interval graph. You can find a full description of the puzzle in "Counting Drug Exposure in SASĀ® with Interval Graph Modeling" by Xiaoqiang Wang. In this paper, Wang explains the puzzle and how to use SAS to solve it. John's challenge to me was to use a set-based T-SQL solution to solve the puzzle.

The puzzle consists of two parts. I covered the first part of the puzzle in "T-SQL Interval Graphs Challenge, Part 1"; I'll cover the second part of the puzzle in this article. In addition to John, I'd also like to thank Alejandro Mesa, Paul White, Adam Machanic, Peter Larsson, and Geri Reshef for their input.

Last Month's Challenge

The puzzle involves data representing drug prescriptions. Run the code in Listing 1 to create the Prescriptions and Drugs tables and fill them with small sets of sample data to validate your solutions.

```SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID(N'dbo.Prescriptions', N'U') IS NOT NULL DROP TABLE dbo.Prescriptions;
IF OBJECT_ID(N'dbo.Patients', N'U') IS NOT NULL DROP TABLE dbo.Patients;

CREATE TABLE dbo.Patients
(
patientid INT NOT NULL,
CONSTRAINT PK_Patients PRIMARY KEY(patientid)
);

CREATE TABLE dbo.Prescriptions
(
prescriptionid INT  NOT NULL IDENTITY,
patientid      INT  NOT NULL,
drugid     INT  NOT NULL,
startdate      DATE NOT NULL,
numdays    INT  NOT NULL,
CONSTRAINT CHK_Prescriptions_ed_sd CHECK(numdays > 0)
);

CREATE UNIQUE CLUSTERED INDEX idx_start
ON dbo.Prescriptions(patientid, drugid, startdate, prescriptionid);

ALTER TABLE dbo.Prescriptions
ADD CONSTRAINT PK_Prescriptions PRIMARY KEY NONCLUSTERED(prescriptionid);

-- code to fill tables with small set of sample data
TRUNCATE TABLE dbo.Prescriptions;
TRUNCATE TABLE dbo.Patients;

INSERT INTO dbo.Patients(patientid) VALUES(1);

INSERT INTO dbo.Prescriptions(patientid, drugid, startdate, numdays) VALUES
(1,  1, '20140101', 3),
(1,  2, '20140101', 5),
(1,  3, '20140102', 4),
(1,  4, '20140102', 5),
(1,  4, '20140103', 2),
(1,  2, '20140105', 5),
(1,  3, '20140106', 4),
(1,  1, '20140107', 3),
(1,  4, '20140108', 1),
(1,  4, '20140120', 4),
(1,  4, '20140122', 1),
(1,  5, '20140212', 3),
(1,  5, '20140215', 3),
(1,  5, '20140220', 1),
(1,  5, '20140223', 1),
(1,  5, '20140226', 1);```

Run the code in Listing 2 to create a helper function called GetNums and populate the tables with large sets of sample data to check the performance of your solutions.

```IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM L5)
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;
GO

DECLARE
@numpatients      AS INT  = 10000,
@numdrugs     AS INT  = 10,
@numprescriptions AS INT  = 10,
@firststartdate   AS DATE = '20140101',
@laststartdate    AS DATE = '20141231',
@maxnumdays       AS INT  = 30;

TRUNCATE TABLE dbo.Prescriptions;
TRUNCATE TABLE dbo.Patients;

INSERT INTO dbo.Patients(patientid)
SELECT PT.n AS patientid
FROM dbo.GetNums(1, @numpatients) AS PT;

INSERT INTO dbo.Prescriptions(patientid, drugid, startdate, numdays)
SELECT
PT.n AS patientid,
D.n AS drugid,
ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @firststartdate, @laststartdate),
@firststartdate ) AS startdate,
ABS(CHECKSUM(NEWID())) % @maxnumdays + 1 AS numdays
FROM dbo.GetNums(1, @numpatients) AS PT
CROSS JOIN dbo.GetNums(1, @numdrugs) AS D
CROSS JOIN dbo.GetNums(1, @numprescriptions) AS PR;```

The first part of the puzzle (which I covered in "T-SQL Interval Graphs Challenge, Part 1") was to pack the prescriptions of the same patient and drug in a rollover manner. Figure 1 graphically depicts the input and desired results.

The bars represent the input prescription periods and the arrows represent the resulting packed prescriptions.

Figure 2 shows the packed subscriptions as query output.

```patientid    drugid       startdate     enddate
-----------  -----------  ----------    ----------
1        1        2014-01-01    2014-01-04
1        1        2014-01-07    2014-01-10
1        2        2014-01-01    2014-01-11
1        3        2014-01-02    2014-01-10
1        4        2014-01-02    2014-01-10
1        4        2014-01-20    2014-01-25
1        5        2014-02-12    2014-02-18
1        5        2014-02-20    2014-02-21
1        5        2014-02-23    2014-02-24
1        5        2014-02-26    2014-02-27```

The set representing the packed prescriptions is used as the input to the second part of the puzzle, which is the focus of this month's article. Run the code in Listing 3 to create a view called PackedPrescriptions, which is based on the solution query in "T-SQL Interval Graphs Challenge, Part 1."

```IF OBJECT_ID(N'dbo.PackedPrescriptions', N'V') IS NOT NULL
DROP VIEW dbo.PackedPrescriptions;
GO

CREATE VIEW dbo.PackedPrescriptions
AS

WITH C1 AS
(
SELECT prescriptionid, patientid, drugid, startdate, numdays,
- SUM(numdays) OVER(PARTITION BY patientid, drugid
ORDER BY startdate, prescriptionid
ROWS UNBOUNDED PRECEDING) + numdays,
startdate) AS grphelper
FROM dbo.Prescriptions
),
C2 AS
(
SELECT patientid, drugid, startdate, numdays,
MAX(grphelper) OVER(PARTITION BY patientid, drugid
ORDER BY startdate, prescriptionid
ROWS UNBOUNDED PRECEDING) AS grp
FROM C1
)
SELECT
patientid, drugid,
MIN(startdate) AS startdate,
FROM C2
GROUP BY patientid, drugid, grp;
GO```

Recall the way the information about the packed prescriptions is provided. The start date is the first date in each packed period when the patient consumes a drug, and the end date is the date following the last consumption date. For example, patient 1 consumed 10 daily doses of drug 2 starting on January 1, 2014, ending on January 11, 2014. This is what's called a left-closed, right-open interval in mathematics, because it's inclusive of the start but exclusive of the end. The standard notation used for such an interval is [startdate, enddate).

This Month's Challenge

This month's puzzle works with the PackedPrescriptions view as the input data. The task is to identify periods in which the patient is at risk due to exposure to a certain minimum number of drugs (provided as input @cnt) of a certain class. For example, suppose that @cnt = 3. Figure 3 graphically depicts the input prescription periods as arrows and the desired output risk periods as red rectangles.

Figure 4 shows the desired output that your solution should produce.

```patientid     startdate    enddate
-----------   ----------   ----------
1         2014-01-02   2014-01-10```

In a similar manner, Figure 5 and Figure 6 show the information assuming @cnt = 4.

```patientid     startdate    enddate
-----------   ----------   ----------
1         2014-01-02   2014-01-04
1         2014-01-07   2014-01-10```

Next, I'll present my solution in steps. As usual, I recommend that you take a stab at the problem before looking at my solution.

The first step in the solution is to unpivot prescription period start and end dates to separate rows. This is achieved with the following query (the ORDER BY clause is used here just to ensure presentation ordering for clarity):

```SELECT patientid, drugid, type, dt
FROM dbo.PackedPrescriptions
CROSS JOIN(VALUES(1),(-1)) AS ET(type)
CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)
ORDER BY patientid, drugid, dt;```

The query uses a cross join to generate two copies out of each source row from the view. One copy represents start events (marked with type 1), and the other copy represents end events (marked with type -1). Then, a CROSS APPLY operator is used to return the relevant date for the current row (start or end) as the new dt column. Figure 7 shows the output of this query in abbreviated form.

```patientid    drugid       type     dt
-----------  -----------  -----------  ----------
1        1        1        2014-01-01
1        1        -1       2014-01-04
1        1        1        2014-01-07
1        1        -1       2014-01-10
1        2        1        2014-01-01
1        2        -1       2014-01-11
...```

The second step is to compute the count of drugs consumed concurrently per patient and date. This is achieved with the following query:

```SELECT patientid, dt, SUM(type) AS datecnt,
SUM(SUM(type)) OVER(PARTITION BY patientid
ORDER BY dt
ROWS UNBOUNDED PRECEDING) AS cnt
FROM dbo.PackedPrescriptions
CROSS JOIN(VALUES(1),(-1)) AS ET(type)
CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)
GROUP BY patientid, dt
ORDER BY patientid, dt;```

The starting point is the query from the previous step. The unpivoted events are grouped by patient and dt. The grouped SUM aggregate is applied to the type column (remember, 1 is a start event and -1 is an end event) to compute the daily delta. For example, if two periods started and three ended, in total there's a delta of -1 that day. Then the windowed SUM aggregate computes a running total of the grouped SUM to get the balance to date (concurrent drugs consumed that day). Figure 8 shows the output of this step.

```patientid    dt      datecnt      cnt
-----------  ----------  -----------  -----------
1        2014-01-01  2        2
1        2014-01-02  2        4
1        2014-01-04  -1       3
1        2014-01-07  1        4
1        2014-01-10  -3       1
1        2014-01-11  -1       0
1        2014-01-20  1        1
1        2014-01-25  -1       0
1        2014-02-12  1        1
1        2014-02-18  -1       0
1        2014-02-20  1        1
1        2014-02-21  -1       0
1        2014-02-23  1        1
1        2014-02-24  -1       0
1        2014-02-26  1        1
1        2014-02-27  -1       0```

The third step's purpose is to compute two columns called isstart and isend, marking whether a date is a start or end of a period where the patient is at risk. Here's the query implementing the third step:

```DECLARE @cnt AS INT = 3;

WITH C1 AS
(
SELECT patientid, dt, SUM(type) AS datecnt,
SUM(SUM(type)) OVER(PARTITION BY patientid
ORDER BY dt
ROWS UNBOUNDED PRECEDING) AS cnt
FROM dbo.PackedPrescriptions
CROSS JOIN(VALUES(1),(-1)) AS ET(type)
CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)
GROUP BY patientid, dt
)
SELECT *
--  ,(ROW_NUMBER() OVER(PARTITION BY patientid ORDER BY dt) - 1) / 2 + 1 AS grp
FROM C1
CROSS APPLY(VALUES(CASE WHEN cnt >= @cnt AND cnt - datecnt < @cnt THEN 1 ELSE 0 END,
CASE WHEN cnt < @cnt AND cnt - datecnt >= @cnt THEN 1 ELSE 0 END )) AS A(isstart, isend)
--WHERE isstart = 1 OR isend = 1
ORDER BY patientid, dt;```

This query defines a CTE called C1 based on the query implementing the second step. The outer query computes isstart as 1 if both of the following conditions are true:

1. Cnt (count of concurrent drugs consumed that day) is greater than or equal to @cnt (the input count).
2. Cnt in the previous day (cnt - datecnt) was less than @cnt.

Otherwise, the code computes isstart as 0.

Similarly, the code computes isend as 1 if both of the following conditions are true:

1. Cnt is less than or equal to @cnt.
2. Cnt in the next day is greater than or equal to @cnt.

Otherwise, the code computes isend as 0. Figure 9 shows the output of the third step.

```patientid  dt      datecnt  cnt  isstart  isend
---------  ----------  -------  ---  -------  -----
1      2014-01-01  2    2    0    0
1      2014-01-02  2    4    1    0
1      2014-01-04  -1       3    0    0
1      2014-01-07  1    4    0    0
1      2014-01-10  -3       1    0    1
1      2014-01-11  -1       0    0    0
1      2014-01-20  1    1    0    0
1      2014-01-25  -1       0    0    0
1      2014-02-12  1    1    0    0
1      2014-02-18  -1       0    0    0
1      2014-02-20  1    1    0    0
1      2014-02-21  -1       0    0    0
1      2014-02-23  1    1    0    0
1      2014-02-24  -1       0    0    0
1      2014-02-26  1    1    0    0
1      2014-02-27  -1       0    0    0```

Next, step 4 is to filter only start and end events, and compute a target period identifier (grp) for each pair of consecutive start and end events. You might have noticed that the code implementing step 3 already contains the elements implementing step 4 commented out. Uncomment both the filter and the calculation of grp and run the code again, once with @cnt = 3 and again with @cnt = 4. Figure 10 shows the result of this code with @cnt = 3.

```patientid  dt      datecnt  cnt  isstart  isend  grp
---------  ----------  -------  ---  -------  -----  ---
1      2014-01-02  2    4    1    0      1
1      2014-01-10  -3       1    0    1      1```

Figure 11 shows the result of this code with @cnt = 4.

```patientid  dt      datecnt  cnt      isstart  isend  grp
---------  ----------  -------  -----------  -------  -----  ---
1      2014-01-02  2    4        1    0      1
1      2014-01-04  -1       3        0    1      1
1      2014-01-07  1    4        1    0      2
1      2014-01-10  -3       1        0    1      2```

Step 5 completes the solution by pivoting the filtered events to return each period where the patient is at risk in one row. To achieve this you define a CTE called C2 based on the outer query from step 4. Then in the outer query of this step you group the rows from C2 by patientid and grp, and return MIN(dt) as the start date and MAX(dt) as the end date. Listing 4 contains the complete solution code.

```DECLARE @cnt AS INT = 3;

WITH C1 AS
(
SELECT patientid, dt, SUM(type) AS datecnt,
SUM(SUM(type)) OVER(PARTITION BY patientid
ORDER BY dt
ROWS UNBOUNDED PRECEDING) AS cnt
FROM dbo.PackedPrescriptions
CROSS JOIN(VALUES(1),(-1)) AS ET(type)
CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)
GROUP BY patientid, dt
),
C2 AS
(
SELECT *,
(ROW_NUMBER() OVER(PARTITION BY patientid ORDER BY dt) - 1) / 2 + 1 AS grp
FROM C1
CROSS APPLY(VALUES(CASE WHEN cnt >= @cnt AND cnt - datecnt < @cnt THEN 1 ELSE 0 END,
CASE WHEN cnt < @cnt AND cnt - datecnt >= @cnt THEN 1 ELSE 0 END )) AS A(isstart, isend)
WHERE isstart = 1 OR isend = 1
)
SELECT patientid, MIN(dt) AS startdate, MAX(dt) AS enddate
FROM C2
GROUP BY patientid, grp;
GO```

Complete Solution

The complete solution ran for 9 seconds on my system against the large set of sample data. This is pretty good performance, considering the complexity of the task. About 3 seconds can be attributed to the first part of solution (which I covered in "T-SQL Interval Graphs Challenge, Part 1") handling the packing logic. The remaining 6 seconds can be attributed to the second part of the solution (which I covered in this article) identifying the periods at risk. Of course, there's always room for improvement. Perhaps you managed to come up with an even faster solution!