T-SQL Interval Graphs Challenge, Part 1

I recently got a challenge from my friend John Paul Cook, who is a registered nurse and a fellow SQL Server MVP. The challenge involves drug prescriptions modeled as interval graphs. You can find the full details of the challenge in the paper "Counting Drug Exposure in SAS® with Interval Graph Modeling" by Xiaoqiang Wang. In this paper, Wang presents a problem and describes how to solve it with SAS. John's challenge to me was to use T-SQL to find an efficient solution to the problem.

Related: Interval Queries in SQL Server

I get to work on a lot of T-SQL puzzles. I love the process, which often involves many agonizing hours of trying to come up with efficient, elegant solutions. And in some of the cases, when you're really fortunate, you also manage to discover beautiful new patterns that can be used in solving other challenges. Such was the case with this challenge.

I'd like to thank not only John Paul Cook, but also Alejandro Mesa, Paul White, Adam Machanic, Peter Larsson, and Geri Reshef for their insights.

The Challenge

As I mentioned, the challenge involves drug prescriptions modeled as interval graphs. Run the code in Listing 1 to create the Patients and Prescriptions tables and fill them with a small set of sample data. The code in Listing 1 also creates indexes to support the solutions that I'll describe.

SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID(N'dbo.Prescriptions', N'U') IS NOT NULL
  DROP TABLE dbo.Prescriptions;
IF OBJECT_ID(N'dbo.Patients', N'U') IS NOT NULL
  DROP TABLE dbo.Patients;

CREATE TABLE dbo.Patients
(
  patientid INT NOT NULL,
  CONSTRAINT PK_Patients PRIMARY KEY(patientid)
);

CREATE TABLE dbo.Prescriptions
(
  prescriptionid INT  NOT NULL IDENTITY,
  patientid      INT  NOT NULL,
  drugid     INT  NOT NULL,
  startdate      DATE NOT NULL,
  numdays    INT  NOT NULL,
  enddate AS DATEADD(day, numdays, startdate),
  CONSTRAINT CHK_Prescriptions_ed_sd CHECK(numdays > 0)
);

CREATE UNIQUE CLUSTERED INDEX idx_start
  ON dbo.Prescriptions
   (patientid, drugid, startdate, prescriptionid);

ALTER TABLE dbo.Prescriptions
  ADD CONSTRAINT PK_Prescriptions PRIMARY KEY
    NONCLUSTERED(prescriptionid);

-- code to fill tables with small set of sample data
TRUNCATE TABLE dbo.Prescriptions;
TRUNCATE TABLE dbo.Patients;

INSERT INTO dbo.Patients(patientid) VALUES(1);

INSERT INTO dbo.Prescriptions
  (patientid, drugid, startdate, numdays) VALUES
  (1,  1, '20140101', 3),
  (1,  2, '20140101', 5),
  (1,  3, '20140102', 4),
  (1,  4, '20140102', 5),
  (1,  4, '20140103', 2),
  (1,  2, '20140105', 5),
  (1,  3, '20140106', 4),
  (1,  1, '20140107', 3),
  (1,  4, '20140108', 1),
  (1,  4, '20140120', 4),
  (1,  4, '20140122', 1),
  (1,  5, '20140212', 3),
  (1,  5, '20140215', 3),
  (1,  5, '20140220', 1),
  (1,  5, '20140223', 1),
  (1,  5, '20140226', 1);

You can use the small set of sample data to check the validity of your solutions against desired results that I'll provide. You'll need a larger set of sample data to test the performance of your solutions. The code in Listing 2 creates a helper table called GetNums that returns a sequence of integers in a requested range, then uses this function to populate the tables with a larger set of sample data.

IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL
  DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1)
       AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
       AS rownum
        FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

DECLARE
  @numpatients      AS INT  = 10000,
  @numdrugs     AS INT  = 10,
  @numprescriptions AS INT  = 10,
  @firststartdate   AS DATE = '20140101',
  @laststartdate    AS DATE = '20141231',
  @maxnumdays       AS INT  = 30;

TRUNCATE TABLE dbo.Prescriptions;
TRUNCATE TABLE dbo.Patients;

INSERT INTO dbo.Patients(patientid)
  SELECT PT.n AS patientid
  FROM dbo.GetNums(1, @numpatients) AS PT;

INSERT INTO dbo.Prescriptions
  (patientid, drugid, startdate, numdays)
  SELECT
    PT.n AS patientid,
    D.n AS drugid,
    DATEADD(day,
  ABS(CHECKSUM(NEWID()))
    % DATEDIFF(day, @firststartdate, @laststartdate),
  @firststartdate ) AS startdate,
    ABS(CHECKSUM(NEWID())) % @maxnumdays + 1 AS numdays
  FROM dbo.GetNums(1, @numpatients) AS PT
    CROSS JOIN dbo.GetNums(1, @numdrugs) AS D
    CROSS JOIN dbo.GetNums(1, @numprescriptions) AS PR;

The Prescriptions table contains information about drug prescriptions for patients. The table contains the following columns: prescriptionid (surrogate key), patientid, drugid, startdate (start of prescription period), and numdays (number of days the prescription covers). The table also defines a computed column called enddate that computes the date on which the prescription runs out (the day following the last consumption day). For example, a 3-day prescription starting on January 1, 2014 (with a number of days of 3) will get an end date of January 4, 2014. The patient consumes the drug on January 1, January 2, and January 3, and then doesn't have a dose for January 4.

The challenge described in "Counting Drug Exposure in SAS® with Interval Graph Modeling" involves two parts:

Packing intersecting prescription periods
Identifying periods in which the patient was exposed to a certain minimum number of drugs

When John sent me the challenge, he said (and I quote), "Researchers are studying drug prescriptions to look for potential adverse outcomes. More bluntly, and I'm really not being overly dramatic, the objective is to keep patients from being killed."

If that's not a motivator to find an efficient solution, I don't know what is!

What's also interesting is that when I started working on this challenge, I thought the packing problem would be the easy part. Little did I know that this part would turn out to be a formidable challenge that would require most of the effort. In this article, I describe the packing problem and provide both a cursor-based solution and a set-based solution. I'll cover the counting problem in my next article.

The Packing Problem

The packing problem involves packing intersecting prescription periods for the same patient and drug. But there's a twist: A patient doesn't consume more than one dose of a drug on the same day even if there are multiple concurrent prescription periods. Instead, the prescription that started later (or, in case of a tie, the prescription with the greater surrogate key value) is pushed forward in time to the date when the prescription that started earlier expires. This is probably easier to comprehend visually.

Figure 1 graphically depicts the input drug prescription periods based on the small set of sample data, as well as arrows representing the packed periods (again, with the end date representing the day following the last consumption day).

Observe the first three prescription periods for Patient 1, Drug 4:

1—startdate: 1/2; numdays: 5
2—startdate: 1/3; numdays: 2
3—startdate: 1/8; numdays: 1

The first prescription starts on January 2 and lasts for 5 days, so it expires on January 7. The second prescription starts before January 7, so it's pushed forward in time to that date. Because the second prescription lasts for 2 days, it expires on January 9. The third prescription starts before January 9, so it's pushed forward to that date. Because the third prescription lasts for 1 day, it expires on January 10. The packed prescription period is represented by the arrow above the individual prescription periods; it starts on January 2 and expires on January 10.

Your task is to write a solution that returns the packed prescription periods for each patient and drug, based on this logic. Figure 2 shows the desired results based on the small set of sample data.

patientid    drugid       startdate     enddate
-----------  -----------  ----------    ----------
1        1        2014-01-01    2014-01-04
1        1        2014-01-07    2014-01-10
1        2        2014-01-01    2014-01-11
1        3        2014-01-02    2014-01-10
1        4        2014-01-02    2014-01-10
1        4        2014-01-20    2014-01-25
1        5        2014-02-12    2014-02-18
1        5        2014-02-20    2014-02-21
1        5        2014-02-23    2014-02-24
1        5        2014-02-26    2014-02-27

I cover both a cursor-based solution and a set-based one. I provide the performance measures I obtained when running the solutions against the large set of sample data, with results discarded in SQL Server Management Studio (SSMS). To discard results after execution, right-click an empty area in the query pane, select Query Options, and select the Discard results after execution and confirm option under Results-Grid. This action will suppress printing the result rows in the grid output.

Cursor-Based Solution

The cursor-based solution is quite straightforward, involving a single pass over the data. Listing 3 contains the complete solution code. The code declares a table variable called @Result where the packed intervals are stored.

SET NOCOUNT ON;

DECLARE @Result AS TABLE
(
  patientid INT  NOT NULL,
  drugid    INT  NOT NULL,
  startdate DATE NOT NULL,
  enddate   DATE NOT NULL
);

DECLARE
  @patientid      AS INT,
  @drugid     AS INT,
  @startdate      AS DATE,
  @numdays    AS INT,
  @sumnumdays     AS INT,
  @prevpatientid  AS INT,
  @prevdrugid     AS INT,
  @firststartdate AS DATE;

DECLARE C CURSOR FAST_FORWARD FOR
  SELECT patientid, drugid, startdate, numdays
  FROM dbo.Prescriptions
  ORDER BY patientid, drugid, startdate, prescriptionid;

OPEN C;

FETCH NEXT FROM C INTO
  @patientid, @drugid, @startdate, @numdays;

SELECT
  @prevpatientid  = @patientid,
  @prevdrugid     = @drugid,
  @firststartdate = @startdate,
  @sumnumdays     = 0;

WHILE @@fetch_status = 0
BEGIN
  IF    @prevpatientid <> @patientid
     OR @prevdrugid    <> @drugid
     OR DATEADD(day, @sumnumdays, @firststartdate)
      < @startdate
  BEGIN
    INSERT INTO @Result(patientid, drugid, startdate, enddate)
  VALUES(@prevpatientid, @prevdrugid, @firststartdate,
    DATEADD(day, @sumnumdays, @firststartdate));

SELECT
  @firststartdate = @startdate,
  @sumnumdays     = 0;
  END

  SELECT
    @sumnumdays    += @numdays,
    @prevpatientid  = @patientid,
    @prevdrugid     = @drugid;

  FETCH NEXT FROM C INTO
    @patientid, @drugid, @startdate, @numdays;
END

IF @sumnumdays > 0
  INSERT INTO @Result(patientid, drugid, startdate, enddate)
    VALUES(@prevpatientid, @prevdrugid, @firststartdate,
  DATEADD(day, @sumnumdays, @firststartdate));

CLOSE C;

DEALLOCATE C;

SELECT * FROM @Result;

The rows from the Prescriptions table are fed to the cursor, sorted first by patientid and drugid (because the calculation is performed for each patient and drug separately), then by startdate and prescriptionid (as a tiebreaker). The query that feeds the data to the cursor is optimized with an ordered scan of the index idx_start, so no sort operation is required.

The code loops through the cursor records one at a time, fetching the current record's values into local variables. The @firststartdate variable holds the start date of the most recent packed interval, and the @sumnumdays variable holds the running total number of prescription days for the most recent packed interval. For each record fetched, the code checks the following condition to know whether to close the current packed interval and start a new one:

IF    @prevpatientid <> @patientid
   OR @prevdrugid    <> @drugid
   OR DATEADD(day, @sumnumdays, @firststartdate) < @startdate

The current packed interval closes when any of the following conditions is met:

Any of the group elements (patientid or drugid) changes.
The start date of the current packed interval (@firststartdate) plus the running total number of prescription days since it started (@sumnumdays) is before the start date of the new prescription interval.

In case of a match, the code closes the new packed interval, writes its info to the table variable, then opens a new packed interval initializing @firststartdate with @startdate and @sumnumdays with 0. If the current packed interval isn't closed (i.e., the new interval should extend it), the code increments @sumnumdays by @numdays.

When the loop is done, the code inserts the last packed interval info into the table variable (if relevant), then closes and deallocates the cursor. The cursor-based solution completed in 31 seconds on my system.

Set-Based Solution

The set-based solution involves computing a group identifier (call it grp) that associates each prescription interval with the packed interval that it belongs to. This is done in two steps, which are illustrated in Figure 3, along with examples that explain why grp is unique per packed interval.

The first step calculates a value called grphelper. This value is computed as the current startdate value minus the running total sum of all numdays values of the intervals so far, exclusive of the current interval. The following query implements this calculation:

SELECT prescriptionid, patientid, drugid, startdate, numdays,
  DATEADD(day,
    - SUM(numdays) OVER(PARTITION BY patientid, drugid
            ORDER BY startdate, prescriptionid
            ROWS UNBOUNDED PRECEDING) + numdays,
    startdate) AS grphelper
FROM dbo.Prescriptions;

Figure 4 shows the output of this query, for the small set of sample data.

prescriptionid  patientid    drugid       startdate    numdays      grphelper
--------------  -----------  -----------  ----------   -----------  ----------
1           1        1        2014-01-01   3        2014-01-01
8           1        1        2014-01-07   3        2014-01-04
2           1        2        2014-01-01   5        2014-01-01
6           1        2        2014-01-05   5        2013-12-31
3           1        3        2014-01-02   4        2014-01-02
7           1        3        2014-01-06   4        2014-01-02
4           1        4        2014-01-02   5        2014-01-02
5           1        4        2014-01-03   2        2013-12-29
9           1        4        2014-01-08   1        2014-01-01
10          1        4        2014-01-20   4        2014-01-12
11          1        4        2014-01-22   1        2014-01-10
12          1        5        2014-02-12   3        2014-02-12
13          1        5        2014-02-15   3        2014-02-12
14          1        5        2014-02-20   1        2014-02-14
15          1        5        2014-02-23   1        2014-02-16
16          1        5        2014-02-26   1        2014-02-18

The second step is calculating grp as the maximum grphelper value so far. The following query adds this computation on top of the previous query represented by the CTE C1:

WITH C1 AS
(
  SELECT prescriptionid, patientid, drugid, startdate, numdays,
    DATEADD(day,
  - SUM(numdays) OVER(PARTITION BY patientid, drugid
              ORDER BY startdate, prescriptionid
              ROWS UNBOUNDED PRECEDING) + numdays,
  startdate) AS grphelper
  FROM dbo.Prescriptions
)
SELECT *,
  MAX(grphelper) OVER(PARTITION BY patientid, drugid
          ORDER BY startdate, prescriptionid
          ROWS UNBOUNDED PRECEDING) AS grp
FROM C1;

Figure 5 shows the output of this query. It contains both the grphelper and grp values.

prescriptionid  patientid  drugid  startdate   numdays  grphelper   grp
--------------  ---------  ------  ----------  -------  ----------  ----------
1           1      1       2014-01-01  3    2014-01-01  2014-01-01
8           1      1       2014-01-07  3    2014-01-04  2014-01-04
2           1      2       2014-01-01  5    2014-01-01  2014-01-01
6           1      2       2014-01-05  5    2013-12-31  2014-01-01
3           1      3       2014-01-02  4    2014-01-02  2014-01-02
7           1      3       2014-01-06  4    2014-01-02  2014-01-02
4           1      4       2014-01-02  5    2014-01-02  2014-01-02
5           1      4       2014-01-03  2    2013-12-29  2014-01-02
9           1      4       2014-01-08  1    2014-01-01  2014-01-02
10          1      4       2014-01-20  4    2014-01-12  2014-01-12
11          1      4       2014-01-22  1    2014-01-10  2014-01-12
12          1      5       2014-02-12  3    2014-02-12  2014-02-12
13          1      5       2014-02-15  3    2014-02-12  2014-02-12
14          1      5       2014-02-20  1    2014-02-14  2014-02-14
15          1      5       2014-02-23  1    2014-02-16  2014-02-16
16          1      5       2014-02-26  1    2014-02-18  2014-02-18

If you examine Figure 3, you'll see four examples, with accompanying explanations leading to the conclusion that grp is unique per packed interval. Notice in the figure that RT represents the running total sum of numdays so far, exclusive of the current interval.

The first two examples deal with the first two intervals for a given patient and drug, with the interval I1 representing the very first interval, and I2 representing the second interval (started at the same time or after I1 started). The last two examples are similar, but with existence of prescriptions before I1 and I2.

The first example assumes that I2 intersects I1. Here, grphelper1 = s1 - RT1 = s1 - 0 = s1.

Because in this example I2 intersects I1, grphelper2 (s2 - d1) <= s1. Therefore, grphelper2 <= grphelper1. And therefore, grp2 = grp1. In other words, when the two intervals intersect, both get the same grp value and therefore belong to the same packed interval.

The second example assumes that I2 starts after I1 expires. Here, grphelper2 (s2 - d1) > grphelper1 (s1). Therefore, grp2 > grp1. In other words, the two intervals will get different grp values and therefore belong to different packed intervals.

The third and fourth examples are similar to the first and second examples, respectively; however, I1 and I2 aren't the first two prescription intervals. What's different in these examples is that the RT values for both I1 and I2 include an extra delta that's equal to the running total of numdays values for the intervals that started before the l1 and l2 intervals.

The third example is when I2 intersects with I1, but there were intervals before I1. You now need to include the delta in your calculation—but since in the calculations you subtract the same delta from both sides, it doesn't change the result. In this example, I2 intersects I1, so grphelper2 (s2 - d1 - delta) <= s1 - delta. Therefore, grphelper2 <= grphelper1. And therefore, grp2 = grp1. In other words, when I2 intersects I1, even if there were previous prescriptions, the grp values of both intervals are still the same, and they therefore belong to the same packed interval.

The fourth example is when I2 is after I1 but with previous existing prescriptions. Here, grphelper2 (s2 - d1 - delta) > grphelper1 (s1 - delta). Therefore, grp2 > grp1. Again, the two intervals get different grp values and therefore belong to different packed intervals.

Now that you have a grp value that uniquely identifies packed intervals within the same patient and drug group, what's left is to group the rows by patientid, drugid, and grp, and compute the start and end of each packed interval. Listing 4 contains the complete solution query.

WITH C1 AS
(
  SELECT prescriptionid, patientid, drugid, startdate, numdays,
    DATEADD(day,
  - SUM(numdays) OVER(PARTITION BY patientid, drugid
              ORDER BY startdate, prescriptionid
              ROWS UNBOUNDED PRECEDING)
  + numdays,
  startdate) AS grphelper
  FROM dbo.Prescriptions
),
C2 AS
(
  SELECT *,
    MAX(grphelper) OVER(PARTITION BY patientid, drugid
            ORDER BY startdate, prescriptionid
            ROWS UNBOUNDED PRECEDING) AS grp
  FROM C1
)
SELECT
  patientid, drugid,
  MIN(startdate) AS startdate,
  DATEADD(day, SUM(numdays), MIN(startdate)) AS enddate
FROM C2
GROUP BY patientid, drugid, grp;

Figure 6 shows the query plan.

What's interesting to observe about the query plan is that the index idx_start is scanned only once in order, and both window function computations rely on this order. None of them require an explicit sort operation. This makes the solution efficient with linear scaling. This solution completed in 4 seconds on my system.

Still, there's potential for improvement in how the optimizer handles the window aggregate functions. At the moment, window aggregate functions with a frame option are optimized with a pair of operators called Window Spool and Stream Aggregate. The Window Spool operator represents the frame of rows, and the Stream Aggregate operator aggregates those rows. When the frame starts with UNBOUNDED PRECEDING, the optimizer identifies the case as a "fast track" case, and instead of writing all applicable rows to the frame and then aggregating them, it takes the previous row's accumulation and adds the current row's value to compute the new running total.

The problem is that SQL Server writes one row to the spool with the aggregate, and another with the detail (remember that window functions aren't supposed to hide the detail like group functions do). There's great potential here to improve the optimization of the special (and very common) case of the frame ROWS UNBOUNDED PRECEDING. This case could theoretically be optimized like the ROW_NUMBER function is: namely, with a Sequence Project operator that computes the result values on the fly, as opposed to the extra work of writing the rows to a Window Spool (a work table) and aggregating them. Such optimization should easily allow a much faster query, probably around 1 second of run time instead of the current 4 seconds. I hope that Microsoft will further enhance the optimization of window functions, to improve the performance of set-based solutions.

What's Next?

In this article I covered the first part of the drug prescriptions puzzle—packing intersecting prescription periods. I presented a slow cursor-based solution that took 31 seconds to run. I then presented a much faster set-based solution that uses window aggregate functions to compute a group identifier for the packed intervals. In my next article, I'll cover the second part of the puzzle—identifying prescription periods during which the same patient was exposed to more than a certain minimum number of drugs.

Comments

Plain text