Intervals and Counts, Part 4

This article is the fourth in a series about intervals and counts. The series focuses on temporal intervals representing entities such as sessions, phone calls, and chats; it describes how to handle tasks related to such intervals. Last month I discussed how to compute the maximum count during every fixed interval of time. This month's focus is computing the count of active intervals at the beginning of every fixed interval.

In this article, I use the same sample data that I used in the previous articles. Use the code in Listing 1 to create the Sessions table and fill it with a small set of sample data to validate the correctness of the solutions.

SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL DROP TABLE dbo.Sessions;
IF OBJECT_ID(N'dbo.Apps', N'U') IS NOT NULL DROP TABLE dbo.Apps;

CREATE TABLE dbo.Apps
(
  app       VARCHAR(10) NOT NULL,
  CONSTRAINT PK_Apps PRIMARY KEY(app)
);

CREATE TABLE dbo.Sessions
(
  keycol    INT     NOT NULL,
  app       VARCHAR(10) NOT NULL,
  starttime DATETIME2(0)   NOT NULL,
  endtime   DATETIME2(0)   NOT NULL,
  CONSTRAINT PK_Sessions PRIMARY KEY(keycol),
  CONSTRAINT CHK_Sessios_et_st CHECK(endtime > starttime)
);

CREATE UNIQUE INDEX idx_start ON dbo.Sessions(app, starttime, keycol) INCLUDE(endtime);
CREATE UNIQUE INDEX idx_end ON dbo.Sessions(app, endtime, keycol) INCLUDE(starttime);

-- Code to fill Sessions with small set of sample data
TRUNCATE TABLE dbo.Sessions;
TRUNCATE TABLE dbo.Apps;

INSERT INTO dbo.Apps(app) VALUES('app1'),('app2'),('app3');

INSERT INTO dbo.Sessions(keycol, app, starttime, endtime) VALUES
  (2,  'app1', '20130212 08:30:00', '20130212 10:30:00'),
  (3,  'app1', '20130212 08:30:00', '20130212 08:45:00'),
  (5,  'app1', '20130212 09:00:00', '20130212 09:30:00'),
  (7,  'app1', '20130212 09:15:00', '20130212 10:30:00'),
  (11, 'app1', '20130212 09:15:00', '20130212 09:30:00'),
  (13, 'app1', '20130212 10:30:00', '20130212 14:30:00'),
  (17, 'app1', '20130212 10:45:00', '20130212 11:30:00'),
  (19, 'app1', '20130212 11:00:00', '20130212 12:30:00'),
  (23, 'app2', '20130212 08:30:00', '20130212 08:45:00'),
  (29, 'app2', '20130212 09:00:00', '20130212 09:30:00'),
  (31, 'app2', '20130212 11:45:00', '20130212 12:00:00'),
  (37, 'app2', '20130212 12:30:00', '20130212 14:00:00'),
  (41, 'app2', '20130212 12:45:00', '20130212 13:30:00'),
  (43, 'app2', '20130212 13:00:00', '20130212 14:00:00'),
  (47, 'app2', '20130212 14:00:00', '20130212 16:30:00'),
  (53, 'app2', '20130212 15:30:00', '20130212 17:00:00'),
  (61, 'app3', '20130212 08:00:00', '20130212 08:30:00'),
  (62, 'app3', '20130212 08:00:00', '20130212 09:00:00'),
  (63, 'app3', '20130212 09:00:00', '20130212 09:30:00'),
  (64, 'app3', '20130212 09:30:00', '20130212 10:00:00');

As in "Intervals and Counts, Part 3," the indexes idx_start and idx_end are slightly different than in "Intervals and Counts, Part 1" and "Intervals and Counts, Part 2." Idx_start is defined on the keylist (app, starttime, keycol), and include list (endtime), and the index idx_end is defined on the keylist (app, endtime, keycol) and the include list (starttime). In "Intervals and Counts, Part 1" and "Intervals and Counts, Part 2," the indexes had the same keylists but no include list. The solutions in both this article and the previous article ("Intervals and Counts, Part 3") need the include lists for coverage.

Run the code in Listing 2 to fill the Sessions table with a large set of sample data to check the performance of the solutions. Listing 2 also creates a helper function called GetNums, which accepts low and high values as inputs and returns a sequence of integers in the input range.

IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
        FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

-- Code to fill Sessions with large set of sample data
TRUNCATE TABLE dbo.Sessions;
TRUNCATE TABLE dbo.Apps;

DECLARE
  @numrows AS INT = 2000000, -- total number of rows
  @numapps AS INT = 100;     -- number of applications

INSERT INTO dbo.Apps WITH(TABLOCK) (app)
  SELECT 'app' + CAST(n AS VARCHAR(10)) AS app
  FROM dbo.GetNums(1, @numapps) AS Nums;

INSERT INTO dbo.Sessions WITH(TABLOCK)
    (keycol, app, starttime, endtime)
  SELECT
    ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol,
    D.*,
    DATEADD(
  second,
  1 + ABS(CHECKSUM(NEWID())) % (20*60),
  starttime) AS endtime
  FROM
  (
    SELECT
  'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % @numapps AS VARCHAR(10)) AS app,
  DATEADD(
    second,
    1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60),
    '20130101') AS starttime
    FROM dbo.GetNums(1, @numrows) AS Nums
  ) AS D;

The Challenge

The challenge that's the focus of this article is to compute the count of active intervals at the beginning of every fixed interval in a given period. I got this request from a customer who needed to compute the number of chats that were active at the beginning of every fixed interval of time. In our case, you need to compute the count every hour, on the hour. For our purposes, to avoid any confusion, the count should take into consideration all interval end and start events that happened on the hour in question. Figure 1 depicts graphically the small set of sample data and the desired result for the input period starting with '20130212 08:00:00' and ending before '20130212 18:00:00'.

Figure 1: Graph Showing Counts at the Beginnings of Fixed Intervals

Figure 2 shows the desired result from your solution query.

app		ts				cnt
----------	---------------------------	-----------
app1		2013-02-12 08:00:00		0
app1		2013-02-12 09:00:00		2
app1		2013-02-12 10:00:00		2
app1		2013-02-12 11:00:00		3
app1		2013-02-12 12:00:00		2
app1		2013-02-12 13:00:00		1
app1		2013-02-12 14:00:00		1
app1		2013-02-12 15:00:00		0
app1		2013-02-12 16:00:00		0
app1		2013-02-12 17:00:00		0
app2		2013-02-12 08:00:00		0
app2		2013-02-12 09:00:00		1
app2		2013-02-12 10:00:00		0
app2		2013-02-12 11:00:00		0
app2		2013-02-12 12:00:00		0
app2		2013-02-12 13:00:00		3
app2		2013-02-12 14:00:00		1
app2		2013-02-12 15:00:00		1
app2		2013-02-12 16:00:00		2
app2		2013-02-12 17:00:00		0
app3		2013-02-12 08:00:00		2
app3		2013-02-12 09:00:00		1
app3		2013-02-12 10:00:00		0
app3		2013-02-12 11:00:00		0
app3		2013-02-12 12:00:00		0
app3		2013-02-12 13:00:00		0
app3		2013-02-12 14:00:00		0
app3		2013-02-12 15:00:00		0
app3		2013-02-12 16:00:00		0
app3		2013-02-12 17:00:00		0

I'll describe two solutions: one using a join and another using window functions. Both solutions use a helper table called TimeStamps like the one I used in "Intervals and Counts, Part 3." Use the code in Listing 3 to create the TimeStamps table and populate it with a row for every hour in the year 2013.

-- DDL for TimeStamps table
IF OBJECT_ID(N'dbo.TimeStamps', N'U') IS NOT NULL
  DROP TABLE dbo.TimeStamps;

CREATE TABLE dbo.TimeStamps
(
  ts DATETIME2(0) NOT NULL
    CONSTRAINT PK_TimeStamps PRIMARY KEY
);
GO

-- Populate TimeStamps table
DECLARE
  @s AS DATETIME2(0) = '20130101', -- inclusive
  @e AS DATETIME2(0) = '20140101'; -- exclusive

INSERT INTO dbo.TimeStamps WITH (TABLOCK) (ts)
  SELECT DATEADD(hour, n-1, @s) AS ts
  FROM dbo.GetNums(1, DATEDIFF(hour, @s, @e)) AS Nums;
GO

Solution Using Joins

The first solution I'll present is probably the most straightforward way to solve the task, although it isn't the most efficient one. Listing 4 shows the solution code (assuming you run it against the small set of sample data with the input period starting on '20130212 08:00:00' and ending before '20130212 18:00:00').

DECLARE
  @app AS VARCHAR(10) = 'app1',
  @starttime AS DATETIME2(0) = '20130212 08:00:00', -- inclusive
  @endtime   AS DATETIME2(0) = '20130212 18:00:00'; -- exclusive

SELECT A.app, T.ts, COUNT(S.keycol) AS cnt
FROM dbo.Apps AS A
  CROSS JOIN dbo.TimeStamps AS T
  LEFT OUTER JOIN dbo.Sessions AS S
    ON A.app = S.app
    AND T.ts >= S.starttime
    AND T.ts < S.endtime
WHERE ts >= @starttime AND ts < @endtime
GROUP BY A.app, T.ts
ORDER BY A.app, T.ts; -- for presentation purposes

The solution's code performs a cross join between the Apps table (call it A) and the Timestamps table (call it T) to generate a row for every application and start-of-hour. The code then performs a left outer join with the Sessions table (call it S) based on the predicate A.app = S.app AND T.ts >= S.starttime AND T.ts < S.endtime. Each left row representing a distinct combination of application and hour is matched by all sessions with the same applicationthat was active during that start-of-hour. By "active" I mean that the start-of-hour happened on or after the session's start time and before the session's end time.

The code then filters only the rows in the input period, groups the remaining rows by application and timestamp, and finally counts the number of active sessions in each group. The code in Listing 4, run against the small set of sample data, generates the desired result set shown earlier in Figure 2.

To test the performance of the solution, use the code in Listing 5 against the large set of sample data. The input period now starts on January 1, 2013, and ends before February 1, 2013.

DECLARE
  @app AS VARCHAR(10) = 'app1',
  @starttime AS DATETIME2(0) = '20130101',
  @endtime   AS DATETIME2(0) = '20130201';

SELECT A.app, T.ts, COUNT(S.keycol) AS cnt
FROM dbo.Apps AS A
  CROSS JOIN dbo.TimeStamps AS T
  LEFT OUTER JOIN dbo.Sessions AS S
    ON A.app = S.app
    AND T.ts >= S.starttime
    AND T.ts < S.endtime
WHERE ts >= @starttime AND ts < @endtime
GROUP BY A.app, T.ts;

Figure 3 shows the plan for this query (using SQL Sentry's Plan Explorer). This plan is very expensive. The bulk of the cost is in the Hash Match join implementing the left outer join and in the Hash Match aggregate implementing the local aggregate per thread. The reason the join is so expensive is because there are two range predicates involved against two different columns. From the perspective of the Sessions table, the predicates are S.starttime <= T.ts AND S.endtime > T.ts.

Figure 3: Plan for Solution in Listing 5

The code in Listing 5 took four and a half minutes to complete on my machine.

Solution Using Window Functions

The second solution I'll discuss uses window functions and is significantly more efficient than the first solution. Listing 6 shows the solution's code against the small set of sample data, for just one application (app1 in this case), for the input period starting on '20130212 08:00:00' and ending before '20130212 18:00:00'.

DECLARE
  @app AS VARCHAR(10) = 'app1',
  @starttime AS DATETIME2(0) = '20130212 08:00:00', -- inclusive
  @endtime   AS DATETIME2(0) = '20130212 17:00:00'; -- exclusive

WITH C1 AS
(
  SELECT
    endtime AS ts,
    -1 AS increment,
    1  AS ord
  FROM dbo.Sessions
  WHERE app = @app
    AND starttime < @endtime
    AND endtime >= @starttime

  UNION ALL

  SELECT
    starttime AS ts,
    1 AS increment,
    2 AS ord
  FROM dbo.Sessions
  WHERE app = @app
    AND starttime < @endtime
    AND endtime >= @starttime

  UNION ALL

  SELECT
    ts,
    0 AS increment,
    3 AS ord
  FROM dbo.TimeStamps
  WHERE ts >= @starttime AND ts < @endtime
),
C2 AS
(
  SELECT
    ts,
    increment,
    ord,
    SUM(increment) OVER(ORDER BY ts, ord
            ROWS UNBOUNDED PRECEDING) AS cnt
  FROM C1
),
C3 AS
(
  SELECT *,
    CASE
  WHEN increment = 0 THEN LAG(cnt, 1, 0) OVER(ORDER BY ts, ord)
    END AS prv
  FROM C2
)
SELECT ts, prv AS cnt
FROM C3
WHERE increment = 0
ORDER BY ts; -- for presentation purposes

If you read "Intervals and Counts, Part 3," the CTEs C1 and C2 should look familiar to you. These two CTEs are the same as the ones I used in last month's solution. As a reminder, the code in the CTE C1 unifies three sets of events:

End events marked with a -1 increment and 1 as the ordering position (ord).
Start events marked with a +1 increment and 2 as the ordering position.
The start-of-hour timestamps in the input period marked with a 0 (neutral) increment and 3 as the ordering position, ensuring the neutral events are considered after both end and start events that happened at the same time.

The code in the CTE C2 then computes the count of active intervals after each event as the running total sum of the column increment, ordered by the event timestamp (ts) and ordering position (ord). The code in the CTE C3 uses a CASE expression and the LAG function to compute for the neutral events (increment = 0) the count of active intervals. The applicable count on a neutral event is the count after the previous event if one exists and 0 if one doesn't exist. Finally, the outer query filters only neutral events and returns the previous counts as the current applicable counts.

Just as in the previous articles in the series, you can improve parallelism handling by encapsulating the solution in an inline table-valued function and using the APPLY operator against the Apps table. Run the code in Listing 7 to create the function IntervalCounts. The function encapsulates the logic from Listing 6 for an input application and period.

IF OBJECT_ID(N'dbo.IntervalCounts', N'IF') IS NOT NULL DROP FUNCTION dbo.IntervalCounts;
GO
CREATE FUNCTION dbo.IntervalCounts
(
  @app AS VARCHAR(10),
  @starttime AS DATETIME2(0),
  @endtime   AS DATETIME2(0)
) RETURNS TABLE
AS
RETURN
WITH C1 AS
(
  SELECT
    endtime AS ts,
    -1 AS increment,
    1  AS ord
  FROM dbo.Sessions
  WHERE app = @app
    AND starttime < @endtime
    AND endtime >= @starttime

  UNION ALL

  SELECT
    starttime AS ts,
    1 AS increment,
    2 AS ord
  FROM dbo.Sessions
  WHERE app = @app
    AND starttime < @endtime
    AND endtime >= @starttime

  UNION ALL

  SELECT
    ts,
    0 AS increment,
    3 AS ord
  FROM dbo.TimeStamps
  WHERE ts >= @starttime AND ts < @endtime
),
C2 AS
(
  SELECT
    ts,
    increment,
    ord,
    SUM(increment) OVER(ORDER BY ts, ord
            ROWS UNBOUNDED PRECEDING) AS cnt
  FROM C1
),
C3 AS
(
  SELECT *,
    CASE
  WHEN increment = 0 THEN LAG(cnt, 1, 0) OVER(ORDER BY ts, ord)
    END AS prv
  FROM C2
)
SELECT ts, prv AS cnt
FROM C3
WHERE increment = 0;
GO

Use the following query to test the solution for correctness against the small set of sample data:

SELECT A.app, IC.*
FROM dbo.Apps AS A
  CROSS APPLY dbo.IntervalCounts(A.app, '20130212 08:00:00', '20130212 18:00:00') AS IC;

You'll get the desired result shown earlier in Figure 2, although not necessarily in the same presentation order.

Use the following code to test the solution's performance against the large set of sample data:

SELECT A.app, IC.*
FROM dbo.Apps AS A
  CROSS APPLY dbo.IntervalCounts(A.app, '20130101', '20130201') AS IC;

Figure 4 shows the plan for this query.

Figure 4: Plan for Solution in Listing 7

As you can see, this plan is highly efficient. The plan scans the rows from the Apps clustered index. For each application, the plan performs an index seek and an ordered range scan in each of the indexes: idx_start, idx_end, and PK_TimeStamps, scanning only the relevant range for the current application. Based on the ordered range scans, the plan merges the rows and computes all window functions (ROW_NUMBER and LAG functions) without even one explicit sort, which is amazing! This solution took only 11 seconds to complete on my system against the large set of sample data—compared with four and a half minutes for the solution based on joins.

Window Functions to the Rescue Yet Again

In this article, I presented yet another task involving temporal intervals: computing counts of active intervals at the beginnings of fixed intervals in an input period. I covered two solutions: one based on joins and another based on window functions. The solution based on joins is straightforward and would be the likely intuitive attempt for most people; however, it's very inefficient. The solution based on window functions is highly efficient. Time and again, window functions manage to surprise me with their elegance and efficiency. I suspect we haven't seen the last of them, and that people will keep finding new, creative, efficient, and exciting ways to use them in future solutions.

Comments

Plain text