Calculating Concurrent Sessions, Part 2

Last month I presented a task to calculate the maximum number of concurrent sessions for each application. I provided code to create and populate a table called Sessions.

Web Listing 1 contains the code to create and populate the Sessions table with a small set of rows just to check the correctness of the solutions.

SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID('dbo.Sessions', 'U') IS NOT NULL DROP TABLE dbo.Sessions;

CREATE TABLE dbo.Sessions
(
  keycol    INT         NOT NULL,
  app       VARCHAR(10) NOT NULL,
  usr       VARCHAR(10) NOT NULL,
  host      VARCHAR(10) NOT NULL,
  starttime DATETIME    NOT NULL,
  endtime   DATETIME    NOT NULL,
  CONSTRAINT PK_Sessions PRIMARY KEY(keycol),
  CHECK(endtime > starttime)
);
GO

CREATE INDEX idx_nc_app_st_et ON dbo.Sessions(app, starttime, endtime);
CREATE INDEX idx_nc_app_et_st ON dbo.Sessions(app, endtime, starttime);

INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(2,  'app1', 'user1', 'host1', '20090212 08:30', '20090212 10:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(3,  'app1', 'user2', 'host1', '20090212 08:30', '20090212 08:45');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(5,  'app1', 'user3', 'host2', '20090212 09:00', '20090212 09:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(7,  'app1', 'user4', 'host2', '20090212 09:15', '20090212 10:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(11, 'app1', 'user5', 'host3', '20090212 09:15', '20090212 09:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(13, 'app1', 'user6', 'host3', '20090212 10:30', '20090212 14:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(17, 'app1', 'user7', 'host4', '20090212 10:45', '20090212 11:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(19, 'app1', 'user8', 'host4', '20090212 11:00', '20090212 12:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(23, 'app2', 'user8', 'host1', '20090212 08:30', '20090212 08:45');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(29, 'app2', 'user7', 'host1', '20090212 09:00', '20090212 09:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(31, 'app2', 'user6', 'host2', '20090212 11:45', '20090212 12:00');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(37, 'app2', 'user5', 'host2', '20090212 12:30', '20090212 14:00');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(41, 'app2', 'user4', 'host3', '20090212 12:45', '20090212 13:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(43, 'app2', 'user3', 'host3', '20090212 13:00', '20090212 14:00');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(47, 'app2', 'user2', 'host4', '20090212 14:00', '20090212 16:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
  VALUES(53, 'app2', 'user1', 'host4', '20090212 15:30', '20090212 17:00');
GO

Web Listing 2 contains the code to create a helper table function called GetNums, which returns a table result with a sequence of integers of a requested size.

IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL
  DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
  SELECT TOP(@n) n FROM Nums ORDER BY n;
GO

Web Listing 3 contains the code to populate the Sessions table with a large set of rows to test the performance of the solutions.

SET NOCOUNT ON;
USE tempdb;

TRUNCATE TABLE dbo.Sessions;

DECLARE @numrows AS INT;
SET @numrows = 10000; -- Change this value according to your needs

INSERT INTO dbo.Sessions WITH(TABLOCK)
    (keycol, app, usr, host, starttime, endtime)
  SELECT
    ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol,
    D.*,
    DATEADD(
      second,
      1 + ABS(CHECKSUM(NEWID())) % (20*60),
      starttime) AS endtime
  FROM
  (
    SELECT
      'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % 10 AS VARCHAR(10)) AS app,
      'user1' AS usr,
      'host1' AS host,
      DATEADD(
        second,
        1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60),
        '20090101') AS starttime
    FROM dbo.GetNums(@numrows) AS Nums
    WHERE n <= @numrows
  ) AS D;
GO

As a reminder, the task at hand involves calculating, for each application, the maximum number of concurrent sessions. That is, for each application, calculate the maximum number of simultaneously active sessions. Recall that in case one session ends at exactly the same point in time that another session starts, you need to implement a rule dictating whether both are considered active at that point. For our purposes, the assumption is that they aren’t. For the small set of sample data given in Web Listing 1, the desired output is shown in Table 1.

Table 1: Desired Output from Solution

app	mx
app1	4
app2	3

Last month I presented two solutions that I’ve been using for years. One is a set-based solution that uses a subquery with a count aggregate. I referred to that solution as the original set-based solution. I explained that the algorithmic complexity of that solution (or rather, the way its execution plan scales) is quadratic. That is, if you increase the number of rows per partition (application) by a factor of f for the same period of time, the run time increases by a factor of f². So beyond very small partitions, the solution doesn’t scale well.

The second solution that I presented was a cursor-based solution. I explained that the algorithmic complexity of that solution is linear. That is, if the number of rows per partition increases by a factor of f, the run time also increases by a factor of f. The cursor-based solution scales better than the original set-based solution, but it has the obvious disadvantages of cursors related to readability, maintainability, and not being in accord with the relational model.

For years I looked for a set-based solution that performs better than the cursor-based solution for all partition sizes—to no avail—until recently. In this article I present a new set-based solution with linear complexity that performs better than the cursor-based solution. I developed the solution based on insights of Darryl Page from the UK, who attended one of my classes in which I gave this problem as an exercise. I also present another set-based solution that’s based on language elements that SQL Server doesn’t yet support (as of SQL Server 2008). Once such support is introduced, the solution is likely to perform better than all others.

New Set-Based Solution

Listing 1 contains the new set-based solution that I recently developed based on Darryl Page’s insights.

-- Part 1
CREATE TABLE #Ends
(
  app       VARCHAR(10) NOT NULL,
  endtime   DATETIME,
  n         BIGINT
);

CREATE CLUSTERED INDEX idx_app_et ON #Ends(app, endtime);

INSERT INTO #Ends(app, endtime, n)
  SELECT app, endtime,
    RANK() OVER(PARTITION BY app ORDER BY endtime) AS n
  FROM dbo.Sessions;

-- Part 2
WITH Counts AS
(
  SELECT S.app, S.starttime,
    ROW_NUMBER() OVER(PARTITION BY S.app ORDER BY S.starttime)
      - A.n + 1 AS cnt
  FROM dbo.Sessions AS S
    CROSS APPLY (SELECT TOP (1) E.n
                 FROM #Ends AS E
                 WHERE E.app = S.app
                   AND E.endtime > S.starttime
                 ORDER BY E.endtime) AS A
)
SELECT app, MAX(cnt) AS mx
FROM Counts
GROUP BY app;

DROP TABLE #Ends;
GO

Figure 1 shows the execution plan for the solution.

As you can see, the solution consists of two parts. The first part involves creating a temporary table called #Ends with a clustered index on (app, endtime), and populating it with the result of a query against the Sessions table. The query against Sessions retrieves, for each session, the application name (app), session end time (endtime), and a rank partitioned by app and ordered by endtime. The target column holding the rank values is called n. The reason for using the RANK function here rather than ROW_NUMBER has to do with the rule that says if a session ends at the same point in time that another starts, the sessions aren’t considered concurrent. Rank values for multiple sessions with the same end time will be equal to the lowest row number you would get based on the same partitioning and ordering specification. The rank value minus 1 indicates how many sessions against the current application ended before the point in time when the current session ended. This fact is important in the second part of the solution.

The execution plan for the first part of the solution appears in the section called Query 1 in Figure 1. Note that the plan doesn’t involve sorting—neither for the calculation of the rank values nor for populating the target clustered index—which is a key aspect that contributes to the solution’s good performance. An ordered scan of the index created on Sessions(app, endtime, starttime) supports the calculation of the rank values and produces the result in the target clustered index order.

As for the second part of the solution, first consider the query used to define the common table expression (CTE) Counts. The query uses a CROSS APPLY operator to match each current session from the Sessions table with the first session from the #Ends table where #Ends.endtime is greater than Sessions.starttime. You need the rank value from that session (column n). The query then calculates a row number for each session partitioned by app, and ordered by starttime (call it rownum); rownum indicates how many sessions started so far. In case multiple sessions started at the same point in time, the maximum rownum for those sessions will be the correct indication of how many sessions started so far. Now, rownum - (n - 1), which is equal to rownum - n + 1, gives you the number of concurrent sessions at the point in time when the current session started (call it cnt). That is, if you subtract from the row number the rank value of the first session that ended after the current session started and add 1, you get the number of concurrent sessions at the point when the current session started.

What’s left for the outer query against the CTE Counts to do is simply to group the data by app, and return for each app the max cnt. The execution plan for the second part of the solution appears in the section called Query 2 in Figure 1. First, the index on Sessions(app, starttime, endtime) is scanned in order. For each row (session) returned from this scan, the plan performs an Index Seek operation in the clustered index on the #Ends(app, endtime) table to retrieve the n value from the first session for the current application with an end time greater than the current session’s start time. The cost of this seek operation in terms of I/O is as many page reads as the number of levels in the index (three, in our case). The plan also calculates row numbers (Segment and Sequence Project operators), which are used along with n to calculate cnt. Finally, the plan uses a Stream Aggregate operator that relies on the ordered scan of the index on the Sessions table, to calculate the max cnt per application.

What’s important about the solution is that it has linear complexity due to the constant cost per row. The cost is constant per row in both parts of the solution. In the first part, both the calculation of the rank values and the insertion to the target clustered index are based on the existing ordering of the index on Sessions. In the second part, there’s a constant cost per row, including the portion of the scan of the index against sessions, the index seek, and the aggregate. The index seek is the most expensive part of the plan—you pay as many reads per row in Sessions as the number of levels in the index (three, in our case). The cursor-based solution is less expensive in terms of I/O cost; however, due to the extra overhead that cursors incur for each record manipulation compared with set-based solutions, the new set-based solution performs better overall than the cursor-based one.

Figure 2 shows benchmark results for the performance of all solutions (the old set-based solution, the cursor-based solution, and the new set-based solution). You can see that the old set-based solution has quadratic complexity, whereas both the cursor-based solution and the new set-based solution have linear complexity. You can also see that the run time for the new set-based solution is about half the run time of the cursor-based one.

Solution Based on Window Aggregate Functions

Even though you now have a set-based solution that performs better than the cursor-based solution, as well as scales linearly, the I/O cost is still quite high due to the Index Seek operation required per row from the Sessions table. As an example, if you have 1,000,000 rows in the table, you pay 3,000,000 reads for the Index Seek operations.

A set-based solution exists that has the potential to outperform all the other solutions; however, this solution relies on standard language elements that aren’t yet implemented in SQL Server (as of SQL Server 2008). Specifically, the solution uses window ordering (the ORDER BY clause) and framing (the ROWS clause) for window aggregate functions.

The logic of the solution that’s based on window aggregate functions is actually the same as the logic for the cursor-based solution from last month. But instead of using a cursor to calculate a running aggregate of the event type (the +1 or -1 representing whether a session starts or ends), you rely on a window aggregate function to calculate the running aggregate. Listing 2 contains the solution (remember that you can’t run it because it isn’t supported yet).

WITH Events AS
(
  SELECT app, starttime AS ts, 1 AS event_type FROM dbo.Sessions
  UNION ALL
  SELECT app, endtime, -1 FROM dbo.Sessions
),
Counts AS
(
  SELECT app,
    SUM(event_type) OVER(PARTITION BY app
                         ORDER BY ts, event_type
                         ROWS BETWEEN UNBOUNDED PRECEDING
                                  AND CURRENT ROW) AS cnt
  FROM Events
)
SELECT app, MAX(cnt) AS mx
FROM Counts
GROUP BY app;

The query used to define the CTE Events is the same as the one used to define the cursor from last month. This query unifies the set of start events of sessions with the set of end events of sessions. A +1 value is assigned as the event type (event_type attribute) for the start of a session, and a -1 is assigned as the event type for the end of a session.

The code then defines a second CTE called Counts that is based on a query against Events. This query uses a SUM OVER aggregate function, calculating the sum of the event_type values, partitioned by app, ordered by ts (timestamp) and event_type. The ROWS clause frames the applicable window for the calculation. For the defined partition with the defined ordering, the frame is all rows with no low boundary point and until the current row. The result of this calculation is the number of concurrent sessions during the current timestamp (called cnt). Finally, the outermost query simply groups the rows from the CTE Counts by app, and returns the max cnt per application.

We can’t examine the plan for this solution, because SQL Server doesn’t support it yet. If and when such support is introduced, the plan will likely involve an ordered scan of an index on (app, starttime), plus an ordered scan of an index on (app, endtime). In addition, both the calculation of the window aggregate and the grouping will likely rely on that ordering. In other words, it’s a matter of scanning the data twice in index order, with no need for explicit sorting, no expensive cursor overhead, and no expensive Index Seek operations. So this solution is likely to outperform all the other solutions.

Always Seek a Better Solution

Even if you can’t find a good performing set-based solution for a given task, it doesn’t mean that such a solution doesn’t exist. Revisit problems periodically, looking for new ideas and insights that can lead to new, better-performing solutions. Hopefully Microsoft will implement enhancements to the OVER clause, including the missing elements for window aggregate functions. Such enhancements will allow much simpler and more efficient solutions in the future.

Comments

Plain text