Solutions to Packing Date and Time Intervals Puzzle

A couple of months ago I posted a puzzle involving packing date and time intervals. You can find the details of the puzzle here, and further clarifications here. I promised to post the solutions to the puzzle by the first day of spring, so here goes…

I provided both a small set of sample data to test the validity of the solution, and a large set to test its performance. Here’s the small set again:

SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID('dbo.Sessions') IS NOT NULL DROP TABLE dbo.Sessions;

CREATE TABLE dbo.Sessions
(
  id        INT          NOT NULL IDENTITY(1, 1),
  username  VARCHAR(14)  NOT NULL,
  starttime DATETIME2(3) NOT NULL,
  endtime   DATETIME2(3) NOT NULL,
  CONSTRAINT PK_Sessions PRIMARY KEY(id),
  CONSTRAINT CHK_endtime_gteq_starttime
    CHECK (endtime >= starttime)
);

-- indexes
CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id);
CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id);

-- sample data (small)
INSERT INTO dbo.Sessions VALUES
  ('User1', '20111201 08:00:00.000', '20111201 08:30:00.000'),
  ('User1', '20111201 08:30:00.000', '20111201 09:00:00.000'),
  ('User1', '20111201 09:00:00.000', '20111201 09:30:00.000'),
  ('User1', '20111201 10:00:00.000', '20111201 11:00:00.000'),
  ('User1', '20111201 10:30:00.000', '20111201 12:00:00.000'),
  ('User1', '20111201 11:30:00.000', '20111201 12:30:00.000'),
  ('User2', '20111201 08:00:00.000', '20111201 10:30:00.000'),
  ('User2', '20111201 08:30:00.000', '20111201 10:00:00.000'),
  ('User2', '20111201 09:00:00.000', '20111201 09:30:00.000'),
  ('User2', '20111201 11:00:00.000', '20111201 11:30:00.000'),
  ('User2', '20111201 11:32:00.000', '20111201 12:00:00.000'),
  ('User2', '20111201 12:04:00.000', '20111201 12:30:00.000'),
  ('User3', '20111201 08:00:00.000', '20111201 09:00:00.000'),
  ('User3', '20111201 08:00:00.000', '20111201 08:30:00.000'),
  ('User3', '20111201 08:30:00.000', '20111201 09:00:00.000'),
  ('User3', '20111201 09:30:00.000', '20111201 09:30:00.000');

You can use the following code to test the portability of your solution on Oracle:

CREATE TABLE dbo.Sessions
(
  id        INT          NOT NULL,
  username  VARCHAR2(14) NOT NULL,
  starttime TIMESTAMP    NOT NULL,
  endtime   TIMESTAMP    NOT NULL,
  CONSTRAINT PK_Sessions PRIMARY KEY(id),
  CONSTRAINT CHK_endtime_gteq_starttime
    CHECK (endtime >= starttime)
);

-- indexes
CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id);
CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id);

-- sample data
INSERT INTO dbo.Sessions VALUES(1, 'User1', 
  TO_DATE('20111201 08:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 08:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(2, 'User1', 
  TO_DATE('20111201 08:30:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 09:00:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(3, 'User1', 
  TO_DATE('20111201 09:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 09:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(4, 'User1', 
  TO_DATE('20111201 10:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 11:00:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(5, 'User1', 
  TO_DATE('20111201 10:30:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 12:00:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(6, 'User1', 
  TO_DATE('20111201 11:30:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 12:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(7, 'User2', 
  TO_DATE('20111201 08:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 10:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(8, 'User2', 
  TO_DATE('20111201 08:30:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 10:00:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(9, 'User2', 
  TO_DATE('20111201 09:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 09:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(10, 'User2', 
  TO_DATE('20111201 11:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 11:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(11, 'User2', 
  TO_DATE('20111201 11:32:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 12:00:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(12, 'User2', 
  TO_DATE('20111201 12:04:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 12:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(13, 'User3', 
  TO_DATE('20111201 08:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 09:00:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(14, 'User3', 
  TO_DATE('20111201 08:00:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 08:30:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(15, 'User3', 
  TO_DATE('20111201 08:30:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 09:00:00',  'YYYYMMDD HH24:MI:SS'));
INSERT INTO dbo.Sessions VALUES(16, 'User3', 
  TO_DATE('20111201 09:30:00',  'YYYYMMDD HH24:MI:SS'), 
  TO_DATE('20111201 09:30:00',  'YYYYMMDD HH24:MI:SS'));

And here’s the large set:

-- helper function GetNums
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL
  DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
    L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
  SELECT TOP (@n) n FROM Nums ORDER BY n;
GO

-- code to create and populate the table Users
IF OBJECT_ID('dbo.Users') IS NOT NULL DROP TABLE dbo.Users;

CREATE TABLE dbo.Users
(
  username  VARCHAR(14)  NOT NULL,
  CONSTRAINT PK_Users PRIMARY KEY(username)
);

DECLARE @num_users AS INT = 2000;

INSERT INTO dbo.Users(username)
  SELECT 'User' + RIGHT('000000000' + CAST(U.n AS VARCHAR(10)), 10) AS username
  FROM dbo.GetNums(@num_users) AS U;
GO

-- code to create and populate the table Sessions with 5,000,000 rows
DECLARE 
  @num_users          AS INT          = 1000,
  @intervals_per_user AS INT          = 5000,
  @start_period       AS DATETIME2(3) = '20110101',
  @end_period         AS DATETIME2(3) = '20110107',
  @max_duration_in_ms AS INT  = 3600000; -- 60 nimutes
  
TRUNCATE TABLE dbo.Sessions;

WITH C AS
(
  SELECT 'User' + RIGHT('000000000' + CAST(U.n AS VARCHAR(10)), 10) AS username,
      DATEADD(ms, ABS(CHECKSUM(NEWID())) % 86400000,
        DATEADD(day, ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @start_period, @end_period), @start_period)) AS starttime
  FROM dbo.GetNums(@num_users) AS U
    CROSS JOIN dbo.GetNums(@intervals_per_user) AS I
)
INSERT INTO dbo.Sessions WITH (TABLOCK) (username, starttime, endtime)
  SELECT username, starttime,
    DATEADD(ms, ABS(CHECKSUM(NEWID())) % (@max_duration_in_ms + 1), starttime) AS endtime
  FROM C;

To remind you, your task is to pack each set of intervals that overlap or abut into one contiguous interval. The desired result for the small set of sample data is:

username  starttime               endtime
--------- ----------------------- -----------------------
User1     2011-12-01 08:00:00.000 2011-12-01 09:30:00.000
User1     2011-12-01 10:00:00.000 2011-12-01 12:30:00.000
User2     2011-12-01 08:00:00.000 2011-12-01 10:30:00.000
User2     2011-12-01 11:00:00.000 2011-12-01 11:30:00.000
User2     2011-12-01 11:32:00.000 2011-12-01 12:00:00.000
User2     2011-12-01 12:04:00.000 2011-12-01 12:30:00.000
User3     2011-12-01 08:00:00.000 2011-12-01 09:00:00.000
User3     2011-12-01 09:30:00.000 2011-12-01 09:30:00.000

I’d like to thanks all those who sent solutions. Special thanks go to Alejandro Mesa for helping me verify the validity of my solutions and for their safekeeping. Here I’m going to cover only correct solutions (as far as I could tell). I will include also solutions that didn’t meet all of the puzzle’s requirements, like being standard, set-based solutions. Following are the solutions and their run times as measured on my Alienware M15x laptop with a Core i7 processor (8 logical CPUs), against hot cache.

Itzik Classic: 5,621 Seconds

The first one I want to present is the classic set-based solution that was available for some time:

-- indexes
/*
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime);
*/

-- solution
WITH StartTimes AS
(
  SELECT DISTINCT username, starttime
  FROM dbo.Sessions AS S1
  WHERE NOT EXISTS
    (SELECT * FROM dbo.Sessions AS S2
     WHERE S2.username = S1.username
       AND S2.starttime = S1.starttime)
),
EndTimes AS
(
  SELECT DISTINCT username, endtime
  FROM dbo.Sessions AS S1
  WHERE NOT EXISTS
    (SELECT * FROM dbo.Sessions AS S2
     WHERE S2.username = S1.username
       AND S2.endtime > S1.endtime
       AND S2.starttime = starttime) AS endtime
FROM StartTimes AS S;

The CTE StartTimes isolates packed interval start times using a query that returns all interval start times for which you cannot find any interval by the same user that started before the current interval start and ended on or after the current interval start. The EndTimes CTE isolates packed interval end times using a query that returns all interval end times for which you cannot find any interval by the same user that ended after the current interval end and started on or before the current interval end. The outer query than matches to each packed interval start the nearest packed interval end going forward by returning the minimum end that is greater than or equal to the current start. As you can see, the solution is very slow.

Geri Reshef: 1,690 Seconds

-- indexing
/*
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime);
*/

-- solution
With T1 As
(
  Select UserName, StartTime Time
  From Sessions 

  Union All 

  Select UserName, EndTime 
  From Sessions
), 
T2 As 
(
  Select Row_Number() Over(Partition By UserName Order By Time) Nm, 
		UserName, Time 
  From	T1
), 
T3 As 
(
  Select	A.Nm-Row_Number() Over(Partition By A.UserName Order By A.Time,B.Time) Nm1, 
		A.UserName, 
		A.Time StartTime, 
		B.Time EndTime 
  From	T2 A 
    Inner join T2 B 
	   	 On A.UserName = B.UserName 
	  	And A.Nm=B.Nm - 1 
  Where	Exists 
    ( Select	* 
		  From	Sessions S 
		  Where	S.UserName = A.UserName 
		  	And (S.StartTime  A.Time) ) 
	 	Or A.Time=B.Time
)
Select	UserName, 
		Min(StartTime) StartTime, 
		Max(EndTime) EndTime 
From	T3 
Group By UserName, Nm1 
Order By UserName, StartTime;

Ami Levin: 225 Seconds

-- Indexing
/*
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
*/

-- Solution
SELECT UserName, StartTime, EndTime, 
  ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY StartTime, EndTime) AS Serial
INTO #MySessions
FROM dbo.Sessions;

CREATE INDEX idx_serial_user_start_end ON #MySessions(serial, username, starttime, endtime);

WITH Sessions_Ordered
AS
(
  SELECT UserName, StartTime, EndTime, Serial
  FROM #MySessions
),
Sessions_Groups
AS
(
  SELECT *, 0 AS Grouper
  FROM Sessions_Ordered 
  WHERE Serial = 1
  
  UNION ALL 

  SELECT B.UserName, B.StartTime, 
    CASE WHEN B.EndTime > A.EndTime THEN B.EndTime ELSE A.EndTime END,
    B.Serial, 
    A.Grouper + CASE
                  WHEN A.EndTime

Alejandro Mesa: 130 Seconds

-- indexing
/*
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
*/

-- solution
SELECT
    username,
    starttime,
    endtime,
    ROW_NUMBER() OVER(PARTITION BY username ORDER BY starttime, endtime, id) AS rn
INTO
    #T1
FROM
	dbo.[Sessions]

CREATE UNIQUE CLUSTERED INDEX idx_#T1_id
ON #T1(username ASC, rn ASC);

WITH rs AS (
SELECT
    username,
    starttime,
    endtime,
    rn
FROM
	#T1
WHERE
    rn = 1
    
UNION ALL

SELECT
    PRV.username,
    CASE
    WHEN R.overlaps = 1 THEN PRV.starttime
    ELSE NXT.starttime
    END,
    CASE
    WHEN R.overlaps = 1 AND PRV.endtime >= NXT.endtime THEN PRV.endtime
    ELSE NXT.endtime
    END,
    NXT.rn
FROM
	rs AS PRV
	INNER JOIN
	#T1 AS NXT
	ON NXT.username = PRV.username 
	AND NXT.rn = PRV.rn + 1
	CROSS APPLY
	(
	SELECT
	    CASE 
	    WHEN PRV.starttime = NXT.starttime THEN 1
	    ELSE 0
	    END AS overlaps
	) AS R
)
SELECT
    username,
    starttime AS new_starttime,
    MAX(endtime) AS new_endtime
FROM
	rs
GROUP BY
    username,
    starttime
--ORDER BY
--    username,
--    starttime
OPTION (MAXRECURSION 0);

DROP TABLE #T1;

Alejandro’s description of the solution:

The solution assigns a consecutive number using ROW_NUMBER partitioned by (username) and ordered by (starttime, endtime, id), and dumps it into a table that will be used by a recursive cte. The anchor part select all rows where “rn = 1”, first row for each username, and the recursive part join the anchor with the next row in line “prv.username = nxt.username and nxt.rn = prv.rn + 1”. I also used the apply operator to return a column to identify when previous and next rows overlap, by using (prv.starttime = nxt.starttime). Then in the column list of the recursive part, I push down prv.username, prv.starttime if both rows overlap, and prv.endtime if the rows overlap and prv.endtime is greater than or equal to next.endtime, otherwise nxt.endtime.

The end result will have groups by (username, starttime), so the next step is to group by those columns and pull the maximum endtime.

Stefan: 40 Seconds

-- indexing
/*
CREATE UNIQUE INDEX idx_user_start ON dbo.Sessions(username, starttime, id);
*/

-- solution
if object_id('tempdb..#intervals') is not null drop table #intervals

;with cte1 as 
(
  select *, datediff(hour, 0, starttime)+n-1 as hash
  from sessions
    cross apply dbo.GetNums(datediff(hour, 0, endtime)-datediff(hour, 0, starttime)+1) t
), 
cte2 as 
(
  select distinct username, endtime
  from sessions a
  where not exists(
    select * 
    from cte1 b
    where a.username=b.username 
      and b.starttimea.endtime
      and b.hash=datediff(hour, 0, a.endtime) )
)
select *, row_number() over (partition by username order by endtime) as rn
into #intervals
from cte2

-- add information about the start of each interval by locating the first session after each interval
;with cte1 as 
(
  select a.username, a.endtime, isnull(b.endtime,'19000101') as pe
  from #intervals a
    left join #intervals b
      on a.username=b.username and a.rn=b.rn+1
)
select username, 
  (select min(b.starttime)
   from sessions b 
   where a.username=b.username and b.starttime>a.pe) as starttime,
  endtime
from cte1 a

Stefan’s description of the solution:

Create a table with one row per interval. endtime is the end of each interval. The end of an interval is identified by finding the end of a session that is not overlapped by any other session. To speed up the search for overlapping sessions we only search the sessions that overlap the hour we are searching for.

Peter Larsson (Peso) 1: 32 Seconds

-- indexes
/*
CREATE UNIQUE INDEX idx_user_start ON dbo.Sessions(username, starttime, id);
CREATE UNIQUE INDEX idx_user_end ON dbo.Sessions(username, endtime, id);
*/

-- Create staging table
CREATE TABLE	#Stage
		(
			UserName VARCHAR(14) NOT NULL,
			StartTime DATETIME2(3) NOT NULL,
			EndTime DATETIME2(3) NOT NULL
		)

-- Populate staging table with Gap information
INSERT		#Stage
		(
			UserName,
			StartTime,
			EndTime
		)
SELECT		s.UserName,
		s.StartTime,
		e.EndTime
FROM		(
			SELECT	UserName,
				StartTime,
				DENSE_RANK() OVER (PARTITION BY UserName ORDER BY StartTime) - 1 AS rnStart
			FROM	dbo.[Sessions]
		) AS s
INNER JOIN	(
			SELECT	UserName,
				EndTime,
				DENSE_RANK() OVER (PARTITION BY UserName ORDER BY EndTime) AS rnEnd
			FROM	dbo.[Sessions]
		) AS e ON e.UserName = s.UserName
			AND e.rnEnd = s.rnStart
WHERE		e.EndTime

Peter Larsson (Peso) 2: 23 Seconds

-- indexing
/*
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
*/

-- solution
-- Make sure printed output in each iteration doesn't affect timing
SET NOCOUNT ON

-- Create a control table for holding the Islands found for each UserName
CREATE TABLE	#Control
		(
			UserName VARCHAR(14) NOT NULL,			-- Current UserName
			IslandID INT NOT NULL,				-- Current Island number
			IslandStartTime DATETIME2(3) NOT NULL,		-- Starting time for current Island
			ExtensionStartTime DATETIME2(3) NOT NULL,	-- Latest extension StartTime of current Island
			ExtensionEndTime DATETIME2(3) NOT NULL		-- Latest extension EndTime of current Island
		)

-- Create a helper index to speed up island detection
CREATE UNIQUE NONCLUSTERED INDEX UX_Control ON #Control (UserName, IslandID) INCLUDE (IslandStartTime, ExtensionStartTime, ExtensionEndTime)

-- Insert initial starting Island for each UserName
INSERT		#Control
		(
			UserName,
			IslandID,
			IslandStartTime,
			ExtensionStartTime,
			ExtensionEndTime
		)
SELECT		usr.UserName,
		1 AS IslandID,
		curr.StartTime AS IslandStartTime,
		curr.StartTime AS ExtensionStartTime,
		curr.EndTime AS ExtensionEndTime
FROM		dbo.[Users] AS usr
CROSS APPLY	(
			SELECT TOP(1)	w.StartTime,
					w.EndTime
			FROM		dbo.[Sessions] AS w
			WHERE		w.UserName = usr.UserName
			ORDER BY	w.StartTime
		) AS curr(StartTime, EndTime)

-- Repeat loop when an Island is found or extended, quit when no more Island is found.
WHILE @@ROWCOUNT > 0
	MERGE	#Control AS tgt
	USING	(
			SELECT		ctrl.UserName,
					CASE
						WHEN curr.EndTime IS NULL THEN ctrl.IslandID + 1
						ELSE ctrl.IslandID
					END AS IslandID,
					ctrl.ExtensionEndTime AS ExtensionStartTime,
					curr.EndTime AS ExtensionEndTime,
					nxt.StartTime AS IslandStartTime,
					nxt.EndTime AS IslandEndTime
			FROM		#Control AS ctrl
			-- For every interval starting in current Island extension, get the interval that ends last (beyond extension).
			OUTER APPLY	(
						SELECT	MAX(w.EndTime) AS EndTime
						FROM	dbo.[Sessions] AS w WITH (index = idx_user_start_end)
						WHERE	w.UserName = ctrl.UserName
							AND w.StartTime BETWEEN ctrl.ExtensionStartTime AND ctrl.ExtensionEndTime
							AND w.EndTime > ctrl.ExtensionEndTime
					) AS curr(EndTime)
			-- Check to see if an interval exists that is not an extension to current Island
			OUTER APPLY	(
						SELECT TOP(1)	w.StartTime,
								w.EndTime
						FROM		dbo.[Sessions] AS w
						WHERE		w.UserName = ctrl.UserName
								AND w.StartTime > ctrl.ExtensionEndTime
						ORDER BY	w.StartTime
					) AS nxt(StartTime, EndTime)
			-- Only work with the latest Island for each UserName
			CROSS APPLY	(
						SELECT	MAX(c.IslandID) AS IslandID
						FROM	#Control AS c
						WHERE	c.UserName = ctrl.UserName
					) AS isl(IslandID)
			WHERE		ctrl.IslandID = isl.IslandID
					AND (curr.EndTime IS NOT NULL OR nxt.EndTime IS NOT NULL)
		) AS src ON src.UserName = tgt.UserName
			AND src.IslandID = tgt.IslandID
	WHEN	MATCHED			-- An Island is extended
			THEN	UPDATE
				SET	tgt.ExtensionStartTime = src.ExtensionStartTime,
					tgt.ExtensionEndTime =	src.ExtensionEndTime
	WHEN	NOT MATCHED BY TARGET	-- An new Island is found
			THEN	INSERT	(
						UserName,
						IslandID,
						IslandStartTime,
						ExtensionStartTime,
						ExtensionEndTime
					)
				VALUES	(
						src.UserName,
						src.IslandID,
						src.IslandStartTime,
						src.IslandStartTime,
						src.IslandEndTime
					);

-- Present the result in correct order
SELECT		UserName,
		IslandStartTime AS StartTime,
		ExtensionEndTime AS EndTime
FROM		#Control
ORDER BY	UserName,
		IslandID

-- Clean up
DROP TABLE	#Control

Itzik New 1: 17 Seconds

-- indexes 
/*
CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id);
CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id);
*/

WITH C1 AS
-- let e = end ordinals, let s = start ordinals
(
  SELECT id, username, starttime AS ts, +1 AS type, NULL AS e,
    ROW_NUMBER() OVER(PARTITION BY username ORDER BY starttime, id) AS s
  FROM dbo.Sessions

  UNION ALL

  SELECT id, username, endtime AS ts, -1 AS type, 
    ROW_NUMBER() OVER(PARTITION BY username ORDER BY endtime, id) AS e,
    NULL AS s
  FROM dbo.Sessions
),
C2 AS
-- let se = start or end ordinal, namely, how many events (start or end) happened so far
(
  SELECT C1.*, ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts, type DESC, id) AS se
  FROM C1
),
C3 AS
-- For start events, the expression s - (se - s) - 1 represents how many sessions were active
-- just before the current (hence - 1)
--
-- For end events, the expression (se - e) - e represents how many sessions are active
-- right after this one
--
-- The above two expressions are 0 exactly when a group of packed intervals 
-- either starts or ends, respectively
--
-- After filtering only events when a group of packed intervals either starts or ends,
-- group each pair of adjacent start/end events
(
  SELECT username, ts, 
    FLOOR((ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts) - 1) / 2 + 1) AS grpnum
  FROM C2
  WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT username, MIN(ts) AS starttime, max(ts) AS endtime
FROM C3
GROUP BY username, grpnum;

The technique I used in this solution to calculate the number of active sessions was inspired by a similar technique used by Ben Flanaghan to address a problem called Maximum Concurrent Sessions that I covered in the past in the magazine.

Muhammad Al Pasha: 14 Seconds

-- indexes
/*
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime);
*/

IF OBJECT_ID(N'tempdb..#EndTimes') IS NOT NULL
   DROP TABLE #EndTimes;

/*
   Calculate end time of intervals, and generate row numbers per user,
   and save the result to temp table to enhance performance.
*/

CREATE TABLE #EndTimes
(
   rownum INT NOT NULL,
   username VARCHAR(14) NOT NULL,
   endtime DATETIME2(3) NOT NULL
);

INSERT INTO #EndTimes(rownum, username, endtime)
   SELECT ROW_NUMBER() OVER(PARTITION BY S1.username
                                ORDER BY S1.endtime) AS rownum,
          S1.username, S1.endtime
     FROM dbo.Sessions AS S1
          OUTER APPLY
          (SELECT 1 AS is_overlapped
            WHERE EXISTS(SELECT *
                           FROM dbo.Sessions AS S2 WITH (index = idx_user_end_start)
                          WHERE S2.username = S1.username
                            AND S2.starttime  S1.endtime)) AS O2
    WHERE O2.is_overlapped IS NULL
    GROUP BY S1.username, S1.endtime; -- Eliminate duplicates

/*
   Calculate start time of intervals based on the previous interval end time,
   then join the result with end time of intervals based on rownum and username.
*/

WITH StartTimesCTE AS
(
   SELECT ET.rownum, ET.username, ST.starttime
     FROM (-- Use min possible time as the start time of the first interval for each user.
           SELECT 1 AS rownum, username, CAST('00010101 00:00:00.000' AS DATETIME2(3)) AS endtime
             FROM dbo.Users

            UNION ALL

           -- Use the already calculated end time of intervals for the rest of intervals.
           SELECT rownum + 1, username, endtime
             FROM #EndTimes) AS ET
          CROSS APPLY
          (--start time of interval is the min start time greater than the previous interval end time
           SELECT MIN(ST.starttime) AS starttime
             FROM dbo.Sessions AS ST
            WHERE ST.username = ET.username
              AND ST.starttime > ET.endtime) AS ST
)
-- The end result is just matching starttime with endtime of each interval based on row numbers (so it is based on time order)
SELECT ST.username, ST.starttime, ET.endtime
  FROM StartTimesCTE AS ST
       INNER JOIN
       #EndTimes AS ET
       ON ET.rownum = ST.rownum
          AND ET.username = ST.username;

Peter Larsson (Peso) 3: 4 Seconds

-- indexes
/*
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime);
*/

;WITH cteSource(UserName, theTime, theGrp)
AS (
	SELECT		u.UserName,
			u.theTime,
			(DENSE_RANK() OVER (PARTITION BY u.UserName ORDER BY u.theTime) - 1) / 2 AS theGrp	-- Create artificial groups depending
														-- on the times. Omit duplicates.
	FROM		(
				-- Get all usernames and first start time and last endtime
				SELECT		UserName,
						MIN(StartTime) AS minStartTime,
						MAX(EndTime) AS maxEndTIme
				FROM		dbo.Sessions
				GROUP BY	UserName
			) AS usr
			-- This only get the intermediate gaps
	OUTER APPLY	(
				SELECT		s.StartTime,
						e.EndTime
				FROM		(
							-- Get all starttimes sorted
							SELECT	s.StartTime,
								ROW_NUMBER() OVER (ORDER BY s.StartTime) AS SeqID
							FROM	dbo.Sessions AS s
							WHERE	s.UserName = usr.UserName
						) AS s
				INNER JOIN	(
							-- Get all endtimes sorted
							SELECT	s.EndTime,
								ROW_NUMBER() OVER (ORDER BY s.EndTime) + 1 AS SeqID
							FROM	dbo.Sessions AS s
							WHERE	s.UserName = usr.UserName
						) AS e ON e.SeqID = s.SeqID	-- Match previous end time time against this starttime
				WHERE		e.EndTime

Itzik New 2: 3 Seconds

-- indexes 
/*
CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id);
CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id);
*/

-- Inline Table Function Encapsulating Logic from Solution Itzik New 1 for Single User
IF OBJECT_ID('dbo.UserIntervals', 'IF') IS NOT NULL DROP FUNCTION dbo.UserIntervals;
GO

CREATE FUNCTION dbo.UserIntervals(@user AS VARCHAR(14)) RETURNS TABLE
AS
RETURN
  WITH C1 AS
  (
    SELECT id, starttime AS ts, +1 AS type, NULL AS e,
      ROW_NUMBER() OVER(ORDER BY starttime, id) AS s
    FROM dbo.Sessions
    WHERE username = @user

    UNION ALL

    SELECT id, endtime AS ts, -1 AS type, 
      ROW_NUMBER() OVER(ORDER BY endtime, id) AS e,
      NULL AS s
    FROM dbo.Sessions
    WHERE username = @user
  ),
  C2 AS
  (
    SELECT C1.*, ROW_NUMBER() OVER(ORDER BY ts, type DESC, id) AS se
    FROM C1
  ),
  C3 AS
  (
    SELECT ts, 
      FLOOR((ROW_NUMBER() OVER(ORDER BY ts) - 1) / 2 + 1) AS grpnum
    FROM C2
    WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
  )
  SELECT MIN(ts) AS starttime, max(ts) AS endtime
  FROM C3
  GROUP BY grpnum;
GO

-- Solution
SELECT U.username, A.starttime, A.endtime
FROM dbo.Users AS U
  CROSS APPLY dbo.UserIntervals(U.username) AS A;

What makes this solution so fast is the efficient use of parallelism. Note though that when testing it in different environments and with different arguments for number of users and intervals, I didn’t always get a parallel plan. If you’re not getting a parallel plan, it could be because the machine you’re using has fewer logical CPUs than 8. Just for test purposes, you can use the SQL Server service startup option -P8, which will cause SQL Server to use 8 schedulers like in an environment with 8 logical CPUs. The -P startup parameter is not an officially documented one, so use it just for test purposes to mimic a machine with a desired number of CPUs, not for production purposes. Also, I noticed that in some machines where I tested this code and didn’t get parallel plans, when changing the sample data to 2,000 users each with 2,500 intervals, instead of 1,000 by 5,000, I got parallel plans in more cases. Either way, this solution is still very fast even when using a serial plan.

For more details, I’m going to cover my new solutions extensively in the March edition of SQJ.

Cheers,

Comments

Plain text