T-SQL Interval Graphs Challenge, Part 2

Complete solution to the drug prescription interval packing problem

Itzik Ben-Gan

April 18, 2014

12 Min Read
prescription drugs

In "T-SQL Interval Graphs Challenge, Part 1," I introduced a puzzle that was originally given to me by John Paul Cook, who is a registered nurse and a fellow SQL Server MVP. The puzzle involves counting drug exposure based on data modeled as an interval graph. You can find a full description of the puzzle in "Counting Drug Exposure in SASĀ® with Interval Graph Modeling" by Xiaoqiang Wang. In this paper, Wang explains the puzzle and how to use SAS to solve it. John's challenge to me was to use a set-based T-SQL solution to solve the puzzle.

The puzzle consists of two parts. I covered the first part of the puzzle in "T-SQL Interval Graphs Challenge, Part 1"; I'll cover the second part of the puzzle in this article. In addition to John, I'd also like to thank Alejandro Mesa, Paul White, Adam Machanic, Peter Larsson, and Geri Reshef for their input.

Last Month's Challenge

The puzzle involves data representing drug prescriptions. Run the code in Listing 1 to create the Prescriptions and Drugs tables and fill them with small sets of sample data to validate your solutions.

SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Prescriptions', N'U') IS NOT NULL DROP TABLE dbo.Prescriptions;IF OBJECT_ID(N'dbo.Patients', N'U') IS NOT NULL DROP TABLE dbo.Patients;CREATE TABLE dbo.Patients(  patientid INT NOT NULL,  CONSTRAINT PK_Patients PRIMARY KEY(patientid));CREATE TABLE dbo.Prescriptions(  prescriptionid INT  NOT NULL IDENTITY,  patientid      INT  NOT NULL,  drugid     INT  NOT NULL,  startdate      DATE NOT NULL,  numdays    INT  NOT NULL,  enddate AS DATEADD(day, numdays, startdate),  CONSTRAINT CHK_Prescriptions_ed_sd CHECK(numdays > 0));CREATE UNIQUE CLUSTERED INDEX idx_start  ON dbo.Prescriptions(patientid, drugid, startdate, prescriptionid);ALTER TABLE dbo.Prescriptions  ADD CONSTRAINT PK_Prescriptions PRIMARY KEY NONCLUSTERED(prescriptionid);-- code to fill tables with small set of sample dataTRUNCATE TABLE dbo.Prescriptions;TRUNCATE TABLE dbo.Patients;INSERT INTO dbo.Patients(patientid) VALUES(1);INSERT INTO dbo.Prescriptions(patientid, drugid, startdate, numdays) VALUES  (1,  1, '20140101', 3),  (1,  2, '20140101', 5),  (1,  3, '20140102', 4),  (1,  4, '20140102', 5),  (1,  4, '20140103', 2),  (1,  2, '20140105', 5),  (1,  3, '20140106', 4),  (1,  1, '20140107', 3),  (1,  4, '20140108', 1),  (1,  4, '20140120', 4),  (1,  4, '20140122', 1),  (1,  5, '20140212', 3),  (1,  5, '20140215', 3),  (1,  5, '20140220', 1),  (1,  5, '20140223', 1),  (1,  5, '20140226', 1);

Run the code in Listing 2 to create a helper function called GetNums and populate the tables with large sets of sample data to check the performance of your solutions.

IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN  WITH    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum        FROM L5)  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n  FROM Nums  ORDER BY rownum;GODECLARE  @numpatients      AS INT  = 10000,  @numdrugs     AS INT  = 10,  @numprescriptions AS INT  = 10,  @firststartdate   AS DATE = '20140101',  @laststartdate    AS DATE = '20141231',  @maxnumdays       AS INT  = 30;TRUNCATE TABLE dbo.Prescriptions;TRUNCATE TABLE dbo.Patients;INSERT INTO dbo.Patients(patientid)  SELECT PT.n AS patientid  FROM dbo.GetNums(1, @numpatients) AS PT;INSERT INTO dbo.Prescriptions(patientid, drugid, startdate, numdays)  SELECT    PT.n AS patientid,    D.n AS drugid,    DATEADD(day,  ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @firststartdate, @laststartdate),  @firststartdate ) AS startdate,    ABS(CHECKSUM(NEWID())) % @maxnumdays + 1 AS numdays  FROM dbo.GetNums(1, @numpatients) AS PT    CROSS JOIN dbo.GetNums(1, @numdrugs) AS D    CROSS JOIN dbo.GetNums(1, @numprescriptions) AS PR;

The first part of the puzzle (which I covered in "T-SQL Interval Graphs Challenge, Part 1") was to pack the prescriptions of the same patient and drug in a rollover manner. Figure 1 graphically depicts the input and desired results.

The bars represent the input prescription periods and the arrows represent the resulting packed prescriptions.

Figure 2 shows the packed subscriptions as query output.

patientid    drugid       startdate     enddate-----------  -----------  ----------    ----------1        1        2014-01-01    2014-01-041        1        2014-01-07    2014-01-101        2        2014-01-01    2014-01-111        3        2014-01-02    2014-01-101        4        2014-01-02    2014-01-101        4        2014-01-20    2014-01-251        5        2014-02-12    2014-02-181        5        2014-02-20    2014-02-211        5        2014-02-23    2014-02-241        5        2014-02-26    2014-02-27

The set representing the packed prescriptions is used as the input to the second part of the puzzle, which is the focus of this month's article. Run the code in Listing 3 to create a view called PackedPrescriptions, which is based on the solution query in "T-SQL Interval Graphs Challenge, Part 1."

IF OBJECT_ID(N'dbo.PackedPrescriptions', N'V') IS NOT NULL  DROP VIEW dbo.PackedPrescriptions;GOCREATE VIEW dbo.PackedPrescriptionsASWITH C1 AS(  SELECT prescriptionid, patientid, drugid, startdate, numdays,    DATEADD(day,  - SUM(numdays) OVER(PARTITION BY patientid, drugid              ORDER BY startdate, prescriptionid              ROWS UNBOUNDED PRECEDING) + numdays,  startdate) AS grphelper  FROM dbo.Prescriptions),C2 AS(  SELECT patientid, drugid, startdate, numdays,    MAX(grphelper) OVER(PARTITION BY patientid, drugid            ORDER BY startdate, prescriptionid            ROWS UNBOUNDED PRECEDING) AS grp  FROM C1)SELECT  patientid, drugid,  MIN(startdate) AS startdate,  DATEADD(day, SUM(numdays), MIN(startdate)) AS enddateFROM C2GROUP BY patientid, drugid, grp;GO

Recall the way the information about the packed prescriptions is provided. The start date is the first date in each packed period when the patient consumes a drug, and the end date is the date following the last consumption date. For example, patient 1 consumed 10 daily doses of drug 2 starting on January 1, 2014, ending on January 11, 2014. This is what's called a left-closed, right-open interval in mathematics, because it's inclusive of the start but exclusive of the end. The standard notation used for such an interval is [startdate, enddate).

This Month's Challenge

This month's puzzle works with the PackedPrescriptions view as the input data. The task is to identify periods in which the patient is at risk due to exposure to a certain minimum number of drugs (provided as input @cnt) of a certain class. For example, suppose that @cnt = 3. Figure 3 graphically depicts the input prescription periods as arrows and the desired output risk periods as red rectangles.

Figure 4 shows the desired output that your solution should produce.

patientid     startdate    enddate-----------   ----------   ----------1         2014-01-02   2014-01-10

In a similar manner, Figure 5 and Figure 6 show the information assuming @cnt = 4.

patientid     startdate    enddate-----------   ----------   ----------1         2014-01-02   2014-01-041         2014-01-07   2014-01-10

Next, I'll present my solution in steps. As usual, I recommend that you take a stab at the problem before looking at my solution.

The first step in the solution is to unpivot prescription period start and end dates to separate rows. This is achieved with the following query (the ORDER BY clause is used here just to ensure presentation ordering for clarity):

SELECT patientid, drugid, type, dtFROM dbo.PackedPrescriptions  CROSS JOIN(VALUES(1),(-1)) AS ET(type)  CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)ORDER BY patientid, drugid, dt;

The query uses a cross join to generate two copies out of each source row from the view. One copy represents start events (marked with type 1), and the other copy represents end events (marked with type -1). Then, a CROSS APPLY operator is used to return the relevant date for the current row (start or end) as the new dt column. Figure 7 shows the output of this query in abbreviated form.

patientid    drugid       type     dt-----------  -----------  -----------  ----------1        1        1        2014-01-011        1        -1       2014-01-041        1        1        2014-01-071        1        -1       2014-01-101        2        1        2014-01-011        2        -1       2014-01-11...

The second step is to compute the count of drugs consumed concurrently per patient and date. This is achieved with the following query:

SELECT patientid, dt, SUM(type) AS datecnt,  SUM(SUM(type)) OVER(PARTITION BY patientid          ORDER BY dt          ROWS UNBOUNDED PRECEDING) AS cntFROM dbo.PackedPrescriptions  CROSS JOIN(VALUES(1),(-1)) AS ET(type)  CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)GROUP BY patientid, dtORDER BY patientid, dt;

The starting point is the query from the previous step. The unpivoted events are grouped by patient and dt. The grouped SUM aggregate is applied to the type column (remember, 1 is a start event and -1 is an end event) to compute the daily delta. For example, if two periods started and three ended, in total there's a delta of -1 that day. Then the windowed SUM aggregate computes a running total of the grouped SUM to get the balance to date (concurrent drugs consumed that day). Figure 8 shows the output of this step.

patientid    dt      datecnt      cnt-----------  ----------  -----------  -----------1        2014-01-01  2        21        2014-01-02  2        41        2014-01-04  -1       31        2014-01-07  1        41        2014-01-10  -3       11        2014-01-11  -1       01        2014-01-20  1        11        2014-01-25  -1       01        2014-02-12  1        11        2014-02-18  -1       01        2014-02-20  1        11        2014-02-21  -1       01        2014-02-23  1        11        2014-02-24  -1       01        2014-02-26  1        11        2014-02-27  -1       0

The third step's purpose is to compute two columns called isstart and isend, marking whether a date is a start or end of a period where the patient is at risk. Here's the query implementing the third step:

DECLARE @cnt AS INT = 3;WITH C1 AS(  SELECT patientid, dt, SUM(type) AS datecnt,    SUM(SUM(type)) OVER(PARTITION BY patientid            ORDER BY dt            ROWS UNBOUNDED PRECEDING) AS cnt  FROM dbo.PackedPrescriptions    CROSS JOIN(VALUES(1),(-1)) AS ET(type)    CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)  GROUP BY patientid, dt)SELECT *--  ,(ROW_NUMBER() OVER(PARTITION BY patientid ORDER BY dt) - 1) / 2 + 1 AS grpFROM C1  CROSS APPLY(VALUES(CASE WHEN cnt >= @cnt AND cnt - datecnt < @cnt THEN 1 ELSE 0 END,             CASE WHEN cnt < @cnt AND cnt - datecnt >= @cnt THEN 1 ELSE 0 END )) AS A(isstart, isend)--WHERE isstart = 1 OR isend = 1ORDER BY patientid, dt;

This query defines a CTE called C1 based on the query implementing the second step. The outer query computes isstart as 1 if both of the following conditions are true:

  1. Cnt (count of concurrent drugs consumed that day) is greater than or equal to @cnt (the input count).

  2. Cnt in the previous day (cnt - datecnt) was less than @cnt.

Otherwise, the code computes isstart as 0.

Similarly, the code computes isend as 1 if both of the following conditions are true:

  1. Cnt is less than or equal to @cnt.

  2. Cnt in the next day is greater than or equal to @cnt.

Otherwise, the code computes isend as 0. Figure 9 shows the output of the third step.

patientid  dt      datecnt  cnt  isstart  isend---------  ----------  -------  ---  -------  -----1      2014-01-01  2    2    0    01      2014-01-02  2    4    1    01      2014-01-04  -1       3    0    01      2014-01-07  1    4    0    01      2014-01-10  -3       1    0    11      2014-01-11  -1       0    0    01      2014-01-20  1    1    0    01      2014-01-25  -1       0    0    01      2014-02-12  1    1    0    01      2014-02-18  -1       0    0    01      2014-02-20  1    1    0    01      2014-02-21  -1       0    0    01      2014-02-23  1    1    0    01      2014-02-24  -1       0    0    01      2014-02-26  1    1    0    01      2014-02-27  -1       0    0    0

Next, step 4 is to filter only start and end events, and compute a target period identifier (grp) for each pair of consecutive start and end events. You might have noticed that the code implementing step 3 already contains the elements implementing step 4 commented out. Uncomment both the filter and the calculation of grp and run the code again, once with @cnt = 3 and again with @cnt = 4. Figure 10 shows the result of this code with @cnt = 3.

patientid  dt      datecnt  cnt  isstart  isend  grp---------  ----------  -------  ---  -------  -----  ---1      2014-01-02  2    4    1    0      11      2014-01-10  -3       1    0    1      1

Figure 11 shows the result of this code with @cnt = 4.

patientid  dt      datecnt  cnt      isstart  isend  grp---------  ----------  -------  -----------  -------  -----  ---1      2014-01-02  2    4        1    0      11      2014-01-04  -1       3        0    1      11      2014-01-07  1    4        1    0      21      2014-01-10  -3       1        0    1      2

Step 5 completes the solution by pivoting the filtered events to return each period where the patient is at risk in one row. To achieve this you define a CTE called C2 based on the outer query from step 4. Then in the outer query of this step you group the rows from C2 by patientid and grp, and return MIN(dt) as the start date and MAX(dt) as the end date. Listing 4 contains the complete solution code.

DECLARE @cnt AS INT = 3;WITH C1 AS(  SELECT patientid, dt, SUM(type) AS datecnt,    SUM(SUM(type)) OVER(PARTITION BY patientid            ORDER BY dt            ROWS UNBOUNDED PRECEDING) AS cnt  FROM dbo.PackedPrescriptions    CROSS JOIN(VALUES(1),(-1)) AS ET(type)    CROSS APPLY(VALUES(CASE type WHEN 1 THEN startdate WHEN -1 THEN enddate END)) AS A(dt)  GROUP BY patientid, dt),C2 AS(  SELECT *,    (ROW_NUMBER() OVER(PARTITION BY patientid ORDER BY dt) - 1) / 2 + 1 AS grp  FROM C1    CROSS APPLY(VALUES(CASE WHEN cnt >= @cnt AND cnt - datecnt < @cnt THEN 1 ELSE 0 END,           CASE WHEN cnt < @cnt AND cnt - datecnt >= @cnt THEN 1 ELSE 0 END )) AS A(isstart, isend)  WHERE isstart = 1 OR isend = 1)SELECT patientid, MIN(dt) AS startdate, MAX(dt) AS enddateFROM C2GROUP BY patientid, grp;GO

Complete Solution

The complete solution ran for 9 seconds on my system against the large set of sample data. This is pretty good performance, considering the complexity of the task. About 3 seconds can be attributed to the first part of solution (which I covered in "T-SQL Interval Graphs Challenge, Part 1") handling the packing logic. The remaining 6 seconds can be attributed to the second part of the solution (which I covered in this article) identifying the periods at risk. Of course, there's always room for improvement. Perhaps you managed to come up with an even faster solution!

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like