A couple of months ago I posted a puzzle involving packing date and time intervals. You can find the details of the puzzle here, and further clarifications here. I promised to post the solutions to the puzzle by the first day of spring, so here goes…
I provided both a small set of sample data to test the validity of the solution, and a large set to test its performance. Here’s the small set again:
SET NOCOUNT ON; USE tempdb; IF OBJECT_ID('dbo.Sessions') IS NOT NULL DROP TABLE dbo.Sessions; CREATE TABLE dbo.Sessions ( id INT NOT NULL IDENTITY(1, 1), username VARCHAR(14) NOT NULL, starttime DATETIME2(3) NOT NULL, endtime DATETIME2(3) NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(id), CONSTRAINT CHK_endtime_gteq_starttime CHECK (endtime >= starttime) ); -- indexes CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id); CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id); -- sample data (small) INSERT INTO dbo.Sessions VALUES ('User1', '20111201 08:00:00.000', '20111201 08:30:00.000'), ('User1', '20111201 08:30:00.000', '20111201 09:00:00.000'), ('User1', '20111201 09:00:00.000', '20111201 09:30:00.000'), ('User1', '20111201 10:00:00.000', '20111201 11:00:00.000'), ('User1', '20111201 10:30:00.000', '20111201 12:00:00.000'), ('User1', '20111201 11:30:00.000', '20111201 12:30:00.000'), ('User2', '20111201 08:00:00.000', '20111201 10:30:00.000'), ('User2', '20111201 08:30:00.000', '20111201 10:00:00.000'), ('User2', '20111201 09:00:00.000', '20111201 09:30:00.000'), ('User2', '20111201 11:00:00.000', '20111201 11:30:00.000'), ('User2', '20111201 11:32:00.000', '20111201 12:00:00.000'), ('User2', '20111201 12:04:00.000', '20111201 12:30:00.000'), ('User3', '20111201 08:00:00.000', '20111201 09:00:00.000'), ('User3', '20111201 08:00:00.000', '20111201 08:30:00.000'), ('User3', '20111201 08:30:00.000', '20111201 09:00:00.000'), ('User3', '20111201 09:30:00.000', '20111201 09:30:00.000');
You can use the following code to test the portability of your solution on Oracle:
CREATE TABLE dbo.Sessions ( id INT NOT NULL, username VARCHAR2(14) NOT NULL, starttime TIMESTAMP NOT NULL, endtime TIMESTAMP NOT NULL, CONSTRAINT PK_Sessions PRIMARY KEY(id), CONSTRAINT CHK_endtime_gteq_starttime CHECK (endtime >= starttime) ); -- indexes CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id); CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id); -- sample data INSERT INTO dbo.Sessions VALUES(1, 'User1', TO_DATE('20111201 08:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 08:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(2, 'User1', TO_DATE('20111201 08:30:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 09:00:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(3, 'User1', TO_DATE('20111201 09:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 09:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(4, 'User1', TO_DATE('20111201 10:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 11:00:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(5, 'User1', TO_DATE('20111201 10:30:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 12:00:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(6, 'User1', TO_DATE('20111201 11:30:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 12:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(7, 'User2', TO_DATE('20111201 08:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 10:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(8, 'User2', TO_DATE('20111201 08:30:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 10:00:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(9, 'User2', TO_DATE('20111201 09:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 09:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(10, 'User2', TO_DATE('20111201 11:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 11:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(11, 'User2', TO_DATE('20111201 11:32:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 12:00:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(12, 'User2', TO_DATE('20111201 12:04:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 12:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(13, 'User3', TO_DATE('20111201 08:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 09:00:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(14, 'User3', TO_DATE('20111201 08:00:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 08:30:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(15, 'User3', TO_DATE('20111201 08:30:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 09:00:00', 'YYYYMMDD HH24:MI:SS')); INSERT INTO dbo.Sessions VALUES(16, 'User3', TO_DATE('20111201 09:30:00', 'YYYYMMDD HH24:MI:SS'), TO_DATE('20111201 09:30:00', 'YYYYMMDD HH24:MI:SS'));
And here’s the large set:
-- helper function GetNums IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums; GO CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE AS RETURN WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5) SELECT TOP (@n) n FROM Nums ORDER BY n; GO -- code to create and populate the table Users IF OBJECT_ID('dbo.Users') IS NOT NULL DROP TABLE dbo.Users; CREATE TABLE dbo.Users ( username VARCHAR(14) NOT NULL, CONSTRAINT PK_Users PRIMARY KEY(username) ); DECLARE @num_users AS INT = 2000; INSERT INTO dbo.Users(username) SELECT 'User' + RIGHT('000000000' + CAST(U.n AS VARCHAR(10)), 10) AS username FROM dbo.GetNums(@num_users) AS U; GO -- code to create and populate the table Sessions with 5,000,000 rows DECLARE @num_users AS INT = 1000, @intervals_per_user AS INT = 5000, @start_period AS DATETIME2(3) = '20110101', @end_period AS DATETIME2(3) = '20110107', @max_duration_in_ms AS INT = 3600000; -- 60 nimutes TRUNCATE TABLE dbo.Sessions; WITH C AS ( SELECT 'User' + RIGHT('000000000' + CAST(U.n AS VARCHAR(10)), 10) AS username, DATEADD(ms, ABS(CHECKSUM(NEWID())) % 86400000, DATEADD(day, ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @start_period, @end_period), @start_period)) AS starttime FROM dbo.GetNums(@num_users) AS U CROSS JOIN dbo.GetNums(@intervals_per_user) AS I ) INSERT INTO dbo.Sessions WITH (TABLOCK) (username, starttime, endtime) SELECT username, starttime, DATEADD(ms, ABS(CHECKSUM(NEWID())) % (@max_duration_in_ms + 1), starttime) AS endtime FROM C;
To remind you, your task is to pack each set of intervals that overlap or abut into one contiguous interval. The desired result for the small set of sample data is:
username starttime endtime --------- ----------------------- ----------------------- User1 2011-12-01 08:00:00.000 2011-12-01 09:30:00.000 User1 2011-12-01 10:00:00.000 2011-12-01 12:30:00.000 User2 2011-12-01 08:00:00.000 2011-12-01 10:30:00.000 User2 2011-12-01 11:00:00.000 2011-12-01 11:30:00.000 User2 2011-12-01 11:32:00.000 2011-12-01 12:00:00.000 User2 2011-12-01 12:04:00.000 2011-12-01 12:30:00.000 User3 2011-12-01 08:00:00.000 2011-12-01 09:00:00.000 User3 2011-12-01 09:30:00.000 2011-12-01 09:30:00.000
I’d like to thanks all those who sent solutions. Special thanks go to Alejandro Mesa for helping me verify the validity of my solutions and for their safekeeping. Here I’m going to cover only correct solutions (as far as I could tell). I will include also solutions that didn’t meet all of the puzzle’s requirements, like being standard, set-based solutions. Following are the solutions and their run times as measured on my Alienware M15x laptop with a Core i7 processor (8 logical CPUs), against hot cache.
Itzik Classic: 5,621 Seconds
The first one I want to present is the classic set-based solution that was available for some time:
-- indexes /* CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime); CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime); */ -- solution WITH StartTimes AS ( SELECT DISTINCT username, starttime FROM dbo.Sessions AS S1 WHERE NOT EXISTS (SELECT * FROM dbo.Sessions AS S2 WHERE S2.username = S1.username AND S2.starttime = S1.starttime) ), EndTimes AS ( SELECT DISTINCT username, endtime FROM dbo.Sessions AS S1 WHERE NOT EXISTS (SELECT * FROM dbo.Sessions AS S2 WHERE S2.username = S1.username AND S2.endtime > S1.endtime AND S2.starttime = starttime) AS endtime FROM StartTimes AS S;
The CTE StartTimes isolates packed interval start times using a query that returns all interval start times for which you cannot find any interval by the same user that started before the current interval start and ended on or after the current interval start. The EndTimes CTE isolates packed interval end times using a query that returns all interval end times for which you cannot find any interval by the same user that ended after the current interval end and started on or before the current interval end. The outer query than matches to each packed interval start the nearest packed interval end going forward by returning the minimum end that is greater than or equal to the current start. As you can see, the solution is very slow.
Geri Reshef: 1,690 Seconds
-- indexing /* CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime); CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime); */ -- solution With T1 As ( Select UserName, StartTime Time From Sessions Union All Select UserName, EndTime From Sessions ), T2 As ( Select Row_Number() Over(Partition By UserName Order By Time) Nm, UserName, Time From T1 ), T3 As ( Select A.Nm-Row_Number() Over(Partition By A.UserName Order By A.Time,B.Time) Nm1, A.UserName, A.Time StartTime, B.Time EndTime From T2 A Inner join T2 B On A.UserName = B.UserName And A.Nm=B.Nm - 1 Where Exists ( Select * From Sessions S Where S.UserName = A.UserName And (S.StartTime A.Time) ) Or A.Time=B.Time ) Select UserName, Min(StartTime) StartTime, Max(EndTime) EndTime From T3 Group By UserName, Nm1 Order By UserName, StartTime;
Ami Levin: 225 Seconds
-- Indexing /* CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime); */ -- Solution SELECT UserName, StartTime, EndTime, ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY StartTime, EndTime) AS Serial INTO #MySessions FROM dbo.Sessions; CREATE INDEX idx_serial_user_start_end ON #MySessions(serial, username, starttime, endtime); WITH Sessions_Ordered AS ( SELECT UserName, StartTime, EndTime, Serial FROM #MySessions ), Sessions_Groups AS ( SELECT *, 0 AS Grouper FROM Sessions_Ordered WHERE Serial = 1 UNION ALL SELECT B.UserName, B.StartTime, CASE WHEN B.EndTime > A.EndTime THEN B.EndTime ELSE A.EndTime END, B.Serial, A.Grouper + CASE WHEN A.EndTime
Alejandro Mesa: 130 Seconds
-- indexing /* CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime); */ -- solution SELECT username, starttime, endtime, ROW_NUMBER() OVER(PARTITION BY username ORDER BY starttime, endtime, id) AS rn INTO #T1 FROM dbo.[Sessions] CREATE UNIQUE CLUSTERED INDEX idx_#T1_id ON #T1(username ASC, rn ASC); WITH rs AS ( SELECT username, starttime, endtime, rn FROM #T1 WHERE rn = 1 UNION ALL SELECT PRV.username, CASE WHEN R.overlaps = 1 THEN PRV.starttime ELSE NXT.starttime END, CASE WHEN R.overlaps = 1 AND PRV.endtime >= NXT.endtime THEN PRV.endtime ELSE NXT.endtime END, NXT.rn FROM rs AS PRV INNER JOIN #T1 AS NXT ON NXT.username = PRV.username AND NXT.rn = PRV.rn + 1 CROSS APPLY ( SELECT CASE WHEN PRV.starttime = NXT.starttime THEN 1 ELSE 0 END AS overlaps ) AS R ) SELECT username, starttime AS new_starttime, MAX(endtime) AS new_endtime FROM rs GROUP BY username, starttime --ORDER BY -- username, -- starttime OPTION (MAXRECURSION 0); DROP TABLE #T1;
Alejandro’s description of the solution:
The solution assigns a consecutive number using ROW_NUMBER partitioned by (username) and ordered by (starttime, endtime, id), and dumps it into a table that will be used by a recursive cte. The anchor part select all rows where “rn = 1”, first row for each username, and the recursive part join the anchor with the next row in line “prv.username = nxt.username and nxt.rn = prv.rn + 1”. I also used the apply operator to return a column to identify when previous and next rows overlap, by using (prv.starttime = nxt.starttime). Then in the column list of the recursive part, I push down prv.username, prv.starttime if both rows overlap, and prv.endtime if the rows overlap and prv.endtime is greater than or equal to next.endtime, otherwise nxt.endtime.
The end result will have groups by (username, starttime), so the next step is to group by those columns and pull the maximum endtime.
Stefan: 40 Seconds
-- indexing /* CREATE UNIQUE INDEX idx_user_start ON dbo.Sessions(username, starttime, id); */ -- solution if object_id('tempdb..#intervals') is not null drop table #intervals ;with cte1 as ( select *, datediff(hour, 0, starttime)+n-1 as hash from sessions cross apply dbo.GetNums(datediff(hour, 0, endtime)-datediff(hour, 0, starttime)+1) t ), cte2 as ( select distinct username, endtime from sessions a where not exists( select * from cte1 b where a.username=b.username and b.starttimea.endtime and b.hash=datediff(hour, 0, a.endtime) ) ) select *, row_number() over (partition by username order by endtime) as rn into #intervals from cte2 -- add information about the start of each interval by locating the first session after each interval ;with cte1 as ( select a.username, a.endtime, isnull(b.endtime,'19000101') as pe from #intervals a left join #intervals b on a.username=b.username and a.rn=b.rn+1 ) select username, (select min(b.starttime) from sessions b where a.username=b.username and b.starttime>a.pe) as starttime, endtime from cte1 a
Stefan’s description of the solution:
Create a table with one row per interval. endtime is the end of each interval. The end of an interval is identified by finding the end of a session that is not overlapped by any other session. To speed up the search for overlapping sessions we only search the sessions that overlap the hour we are searching for.
Peter Larsson (Peso) 1: 32 Seconds
-- indexes /* CREATE UNIQUE INDEX idx_user_start ON dbo.Sessions(username, starttime, id); CREATE UNIQUE INDEX idx_user_end ON dbo.Sessions(username, endtime, id); */ -- Create staging table CREATE TABLE #Stage ( UserName VARCHAR(14) NOT NULL, StartTime DATETIME2(3) NOT NULL, EndTime DATETIME2(3) NOT NULL ) -- Populate staging table with Gap information INSERT #Stage ( UserName, StartTime, EndTime ) SELECT s.UserName, s.StartTime, e.EndTime FROM ( SELECT UserName, StartTime, DENSE_RANK() OVER (PARTITION BY UserName ORDER BY StartTime) - 1 AS rnStart FROM dbo.[Sessions] ) AS s INNER JOIN ( SELECT UserName, EndTime, DENSE_RANK() OVER (PARTITION BY UserName ORDER BY EndTime) AS rnEnd FROM dbo.[Sessions] ) AS e ON e.UserName = s.UserName AND e.rnEnd = s.rnStart WHERE e.EndTime
Peter Larsson (Peso) 2: 23 Seconds
-- indexing /* CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime); */ -- solution -- Make sure printed output in each iteration doesn't affect timing SET NOCOUNT ON -- Create a control table for holding the Islands found for each UserName CREATE TABLE #Control ( UserName VARCHAR(14) NOT NULL, -- Current UserName IslandID INT NOT NULL, -- Current Island number IslandStartTime DATETIME2(3) NOT NULL, -- Starting time for current Island ExtensionStartTime DATETIME2(3) NOT NULL, -- Latest extension StartTime of current Island ExtensionEndTime DATETIME2(3) NOT NULL -- Latest extension EndTime of current Island ) -- Create a helper index to speed up island detection CREATE UNIQUE NONCLUSTERED INDEX UX_Control ON #Control (UserName, IslandID) INCLUDE (IslandStartTime, ExtensionStartTime, ExtensionEndTime) -- Insert initial starting Island for each UserName INSERT #Control ( UserName, IslandID, IslandStartTime, ExtensionStartTime, ExtensionEndTime ) SELECT usr.UserName, 1 AS IslandID, curr.StartTime AS IslandStartTime, curr.StartTime AS ExtensionStartTime, curr.EndTime AS ExtensionEndTime FROM dbo.[Users] AS usr CROSS APPLY ( SELECT TOP(1) w.StartTime, w.EndTime FROM dbo.[Sessions] AS w WHERE w.UserName = usr.UserName ORDER BY w.StartTime ) AS curr(StartTime, EndTime) -- Repeat loop when an Island is found or extended, quit when no more Island is found. WHILE @@ROWCOUNT > 0 MERGE #Control AS tgt USING ( SELECT ctrl.UserName, CASE WHEN curr.EndTime IS NULL THEN ctrl.IslandID + 1 ELSE ctrl.IslandID END AS IslandID, ctrl.ExtensionEndTime AS ExtensionStartTime, curr.EndTime AS ExtensionEndTime, nxt.StartTime AS IslandStartTime, nxt.EndTime AS IslandEndTime FROM #Control AS ctrl -- For every interval starting in current Island extension, get the interval that ends last (beyond extension). OUTER APPLY ( SELECT MAX(w.EndTime) AS EndTime FROM dbo.[Sessions] AS w WITH (index = idx_user_start_end) WHERE w.UserName = ctrl.UserName AND w.StartTime BETWEEN ctrl.ExtensionStartTime AND ctrl.ExtensionEndTime AND w.EndTime > ctrl.ExtensionEndTime ) AS curr(EndTime) -- Check to see if an interval exists that is not an extension to current Island OUTER APPLY ( SELECT TOP(1) w.StartTime, w.EndTime FROM dbo.[Sessions] AS w WHERE w.UserName = ctrl.UserName AND w.StartTime > ctrl.ExtensionEndTime ORDER BY w.StartTime ) AS nxt(StartTime, EndTime) -- Only work with the latest Island for each UserName CROSS APPLY ( SELECT MAX(c.IslandID) AS IslandID FROM #Control AS c WHERE c.UserName = ctrl.UserName ) AS isl(IslandID) WHERE ctrl.IslandID = isl.IslandID AND (curr.EndTime IS NOT NULL OR nxt.EndTime IS NOT NULL) ) AS src ON src.UserName = tgt.UserName AND src.IslandID = tgt.IslandID WHEN MATCHED -- An Island is extended THEN UPDATE SET tgt.ExtensionStartTime = src.ExtensionStartTime, tgt.ExtensionEndTime = src.ExtensionEndTime WHEN NOT MATCHED BY TARGET -- An new Island is found THEN INSERT ( UserName, IslandID, IslandStartTime, ExtensionStartTime, ExtensionEndTime ) VALUES ( src.UserName, src.IslandID, src.IslandStartTime, src.IslandStartTime, src.IslandEndTime ); -- Present the result in correct order SELECT UserName, IslandStartTime AS StartTime, ExtensionEndTime AS EndTime FROM #Control ORDER BY UserName, IslandID -- Clean up DROP TABLE #Control
Itzik New 1: 17 Seconds
-- indexes /* CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id); CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id); */ WITH C1 AS -- let e = end ordinals, let s = start ordinals ( SELECT id, username, starttime AS ts, +1 AS type, NULL AS e, ROW_NUMBER() OVER(PARTITION BY username ORDER BY starttime, id) AS s FROM dbo.Sessions UNION ALL SELECT id, username, endtime AS ts, -1 AS type, ROW_NUMBER() OVER(PARTITION BY username ORDER BY endtime, id) AS e, NULL AS s FROM dbo.Sessions ), C2 AS -- let se = start or end ordinal, namely, how many events (start or end) happened so far ( SELECT C1.*, ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts, type DESC, id) AS se FROM C1 ), C3 AS -- For start events, the expression s - (se - s) - 1 represents how many sessions were active -- just before the current (hence - 1) -- -- For end events, the expression (se - e) - e represents how many sessions are active -- right after this one -- -- The above two expressions are 0 exactly when a group of packed intervals -- either starts or ends, respectively -- -- After filtering only events when a group of packed intervals either starts or ends, -- group each pair of adjacent start/end events ( SELECT username, ts, FLOOR((ROW_NUMBER() OVER(PARTITION BY username ORDER BY ts) - 1) / 2 + 1) AS grpnum FROM C2 WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0 ) SELECT username, MIN(ts) AS starttime, max(ts) AS endtime FROM C3 GROUP BY username, grpnum;
The technique I used in this solution to calculate the number of active sessions was inspired by a similar technique used by Ben Flanaghan to address a problem called Maximum Concurrent Sessions that I covered in the past in the magazine.
Muhammad Al Pasha: 14 Seconds
-- indexes /* CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime); CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime); */ IF OBJECT_ID(N'tempdb..#EndTimes') IS NOT NULL DROP TABLE #EndTimes; /* Calculate end time of intervals, and generate row numbers per user, and save the result to temp table to enhance performance. */ CREATE TABLE #EndTimes ( rownum INT NOT NULL, username VARCHAR(14) NOT NULL, endtime DATETIME2(3) NOT NULL ); INSERT INTO #EndTimes(rownum, username, endtime) SELECT ROW_NUMBER() OVER(PARTITION BY S1.username ORDER BY S1.endtime) AS rownum, S1.username, S1.endtime FROM dbo.Sessions AS S1 OUTER APPLY (SELECT 1 AS is_overlapped WHERE EXISTS(SELECT * FROM dbo.Sessions AS S2 WITH (index = idx_user_end_start) WHERE S2.username = S1.username AND S2.starttime S1.endtime)) AS O2 WHERE O2.is_overlapped IS NULL GROUP BY S1.username, S1.endtime; -- Eliminate duplicates /* Calculate start time of intervals based on the previous interval end time, then join the result with end time of intervals based on rownum and username. */ WITH StartTimesCTE AS ( SELECT ET.rownum, ET.username, ST.starttime FROM (-- Use min possible time as the start time of the first interval for each user. SELECT 1 AS rownum, username, CAST('00010101 00:00:00.000' AS DATETIME2(3)) AS endtime FROM dbo.Users UNION ALL -- Use the already calculated end time of intervals for the rest of intervals. SELECT rownum + 1, username, endtime FROM #EndTimes) AS ET CROSS APPLY (--start time of interval is the min start time greater than the previous interval end time SELECT MIN(ST.starttime) AS starttime FROM dbo.Sessions AS ST WHERE ST.username = ET.username AND ST.starttime > ET.endtime) AS ST ) -- The end result is just matching starttime with endtime of each interval based on row numbers (so it is based on time order) SELECT ST.username, ST.starttime, ET.endtime FROM StartTimesCTE AS ST INNER JOIN #EndTimes AS ET ON ET.rownum = ST.rownum AND ET.username = ST.username;
Peter Larsson (Peso) 3: 4 Seconds
-- indexes /* CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime); CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime); */ ;WITH cteSource(UserName, theTime, theGrp) AS ( SELECT u.UserName, u.theTime, (DENSE_RANK() OVER (PARTITION BY u.UserName ORDER BY u.theTime) - 1) / 2 AS theGrp -- Create artificial groups depending -- on the times. Omit duplicates. FROM ( -- Get all usernames and first start time and last endtime SELECT UserName, MIN(StartTime) AS minStartTime, MAX(EndTime) AS maxEndTIme FROM dbo.Sessions GROUP BY UserName ) AS usr -- This only get the intermediate gaps OUTER APPLY ( SELECT s.StartTime, e.EndTime FROM ( -- Get all starttimes sorted SELECT s.StartTime, ROW_NUMBER() OVER (ORDER BY s.StartTime) AS SeqID FROM dbo.Sessions AS s WHERE s.UserName = usr.UserName ) AS s INNER JOIN ( -- Get all endtimes sorted SELECT s.EndTime, ROW_NUMBER() OVER (ORDER BY s.EndTime) + 1 AS SeqID FROM dbo.Sessions AS s WHERE s.UserName = usr.UserName ) AS e ON e.SeqID = s.SeqID -- Match previous end time time against this starttime WHERE e.EndTime
Itzik New 2: 3 Seconds
-- indexes /* CREATE UNIQUE INDEX idx_user_start_id ON dbo.Sessions(username, starttime, id); CREATE UNIQUE INDEX idx_user_end_id ON dbo.Sessions(username, endtime, id); */ -- Inline Table Function Encapsulating Logic from Solution Itzik New 1 for Single User IF OBJECT_ID('dbo.UserIntervals', 'IF') IS NOT NULL DROP FUNCTION dbo.UserIntervals; GO CREATE FUNCTION dbo.UserIntervals(@user AS VARCHAR(14)) RETURNS TABLE AS RETURN WITH C1 AS ( SELECT id, starttime AS ts, +1 AS type, NULL AS e, ROW_NUMBER() OVER(ORDER BY starttime, id) AS s FROM dbo.Sessions WHERE username = @user UNION ALL SELECT id, endtime AS ts, -1 AS type, ROW_NUMBER() OVER(ORDER BY endtime, id) AS e, NULL AS s FROM dbo.Sessions WHERE username = @user ), C2 AS ( SELECT C1.*, ROW_NUMBER() OVER(ORDER BY ts, type DESC, id) AS se FROM C1 ), C3 AS ( SELECT ts, FLOOR((ROW_NUMBER() OVER(ORDER BY ts) - 1) / 2 + 1) AS grpnum FROM C2 WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0 ) SELECT MIN(ts) AS starttime, max(ts) AS endtime FROM C3 GROUP BY grpnum; GO -- Solution SELECT U.username, A.starttime, A.endtime FROM dbo.Users AS U CROSS APPLY dbo.UserIntervals(U.username) AS A;
What makes this solution so fast is the efficient use of parallelism. Note though that when testing it in different environments and with different arguments for number of users and intervals, I didn’t always get a parallel plan. If you’re not getting a parallel plan, it could be because the machine you’re using has fewer logical CPUs than 8. Just for test purposes, you can use the SQL Server service startup option -P8, which will cause SQL Server to use 8 schedulers like in an environment with 8 logical CPUs. The -P startup parameter is not an officially documented one, so use it just for test purposes to mimic a machine with a desired number of CPUs, not for production purposes. Also, I noticed that in some machines where I tested this code and didn’t get parallel plans, when changing the sample data to 2,000 users each with 2,500 intervals, instead of 1,000 by 5,000, I got parallel plans in more cases. Either way, this solution is still very fast even when using a serial plan.
For more details, I’m going to cover my new solutions extensively in the March edition of SQJ.
Cheers,
BG