Solution to the T-SQL Puzzle - Grouping consecutive rows with a common element | ITPro Today: IT News, How-Tos, Trends, Case Studies, Career Tips, More

A couple of weeks ago I posted the T-SQL challenge
“Grouping consecutive rows with a common element”
(http://www.sqlmag.com/article/articleid/93462/sql_server_blog_93462.html).

I got many interesting solutions via e-mail. We were supposed to give away
two prizes: one winner was supposed to be chosen based on the fastest
correct solution, and another winner was supposed to be chosen randomly
out of those who provided correct solutions.

The fastest solutions ran for about 4-5 seconds against the big table, but since
I got several solutions that ran that fast (Maciej Pilecki, Dieter Noeth, Michael
Yang and Sania Chan), I had to randomly choose a winner out of those. So
the winner under this category is Dieter Noeth; you will get one of my
T-SQL books, courtesy of MSPress.
As for the category of correct solutions regardless of performance, I got
correct solutions from: Steve Kass, Maciej Pilecki, Chris Gin, Dieter Noeth,
Tung H. Nguyen, Steve Dassin, Michael Yang, Craig Bennett,
Alejandro Mesa, Rajeev Lahoty and Sania Chan (apologies if I missed
anyone). The winner chosen randomly in this category is Craig Bennett; you
will get a SQL Server Magazine hat. I will send your e-mail addresses to
SQL Server Magazine’s editors, and they will get in touch with you to send
you the prizes.

Here I’ll present two of the approaches used in the fast solutions.

One approach was to calculate for each row from the Attendance table a
grouping element (call it grp) that uniquely identifies a consecutive group of
rows. The expression that calculates grp is:

    ROW_NUMBER() OVER(PARTITION BY student
                      ORDER BY dt, slot) -
    ROW_NUMBER() OVER(PARTITION BY student, attend
                      ORDER BY dt, slot) AS grp

The trick here is that each set of consecutive rows with the same attendance
status gets a unique value that you can later use for grouping. As for the fact
that precedence among rows is based on the combination of columns dt, slot,
the trick is to concatenate the two into one value. This can be achieved with
different techniques. One technique is to use binary concatenation:

    CAST(dt AS BINARY(8)) + CAST(slot AS BINARY(4)) AS dt_slot

Another technique is to add the slot value to dt as a time unit (e.g., second):

    DATEADD(second, slot, DT) AS dt_slot

Once dt and slot are concatenated, of course you can calculate the min and
max in each group, and then break the min and max apart to the two elements.
Here’s the complete solution using binary concatenation:

WITH CTE_Grp
AS
(
  SELECT student, attend,
    CAST(dt AS BINARY(8)) + CAST(slot AS BINARY(4)) AS dt_slot,
    ROW_NUMBER() OVER(PARTITION BY student
                      ORDER BY dt, slot) -
    ROW_NUMBER() OVER(PARTITION BY student, attend
                      ORDER BY dt, slot) AS grp
  FROM dbo.Attendance
),
CTE_RAW
AS
(
  SELECT
    student,
    MIN(dt_slot) AS min_dt_slot,
    MAX(dt_slot) AS max_dt_slot,
    attend,
    COUNT(*) AS cnt
  FROM CTE_Grp
  GROUP BY student, attend, grp
)
SELECT
  student,
  CAST(SUBSTRING(min_dt_slot, 1, 8) AS DATETIME) AS from_dt,
  CAST(SUBSTRING(min_dt_slot, 9, 4) AS INT) AS from_slot,
  CAST(SUBSTRING(max_dt_slot, 1, 8) AS DATETIME) AS to_dt,
  CAST(SUBSTRING(max_dt_slot, 9, 4) AS INT) AS to_slot,
  attend,
  cnt
FROM CTE_RAW;

And here’s the solution that uses datetime based concatenation:

WITH CTE_Grp
AS
(
  SELECT student, attend,
    DATEADD(second, slot, DT) AS dt_slot,
    ROW_NUMBER() OVER(PARTITION BY student
                      ORDER BY dt, slot) -
    ROW_NUMBER() OVER(PARTITION BY student, attend
                      ORDER BY dt, slot) AS grp
  FROM dbo.Attendance
),
CTE_RAW
AS
(
  SELECT
    student,
    MIN(dt_slot) AS min_dt_slot,
    MAX(dt_slot) AS max_dt_slot,
    attend,
    COUNT(*) AS cnt
  FROM CTE_Grp
  GROUP BY student, attend, grp
)
SELECT
  student,
  DATEADD(day, DATEDIFF(day, 0, min_dt_slot), 0) AS from_dt,
  DATEPART(second, min_dt_slot)                  AS from_slot,
  DATEADD(day, DATEDIFF(day, 0, max_dt_slot), 0) AS to_dt,
  DATEPART(second, max_dt_slot)                  AS to_slot,
  attend,
  cnt
FROM CTE_RAW;

Another approach was to:
- Assign row numbers to the rows in the Attendance table based on dt, slot
ordering (partitioned by student)
- Join two instances of the previous result-set to match for each student’s
row the next row (row with the same student and a row number greater by
one)
- Filter rows where the attendance status changes
Here you need to be careful not to lose the first/last row for each student. To
get around this problem some solutions used an outer join. An interesting
technique used by Michael Yang was to add sentinel rows; one before the
first student’s row and another after the last student’s row.
- Assign row numbers to the remaining rows based on dt, slot ordering
(partitioned by student)
- Join two instances of the previous result-set based on a match in student
and an offset of one between the row numbers
- Collect the from and to elements from each row, and calculate the count as
the offset between the originally calculated row numbers

Here’s the complete solution using the sentinel rows approach:

WITH AddIn as -- sentinel rows
(
  SELECT student, dt, 1 AS slot, -1 AS attend,
    CASE dt WHEN '19000101' THEN 0 ELSE cnt + 1 END AS pseudodate
  FROM (SELECT student, COUNT(*) AS cnt
        FROM dbo.Attendance
        GROUP BY student) AS D1,
    (SELECT CAST('19000101' AS DATETIME) AS dt
     UNION ALL
     SELECT CAST('99991231' AS DATETIME)) AS D2
),
Pseudodates AS -- assign row numbers
(
  SELECT student, dt, slot, attend,
    ROW_NUMBER() OVER(PARTITION BY student
                      ORDER BY dt, slot) AS pseudodate
  FROM dbo.Attendance
 
  UNION ALL
  
  SELECT student, dt, slot, attend, pseudodate FROM AddIn
),
-- match current/next rows
-- keep only pairs where status changes
-- assign new row numbers
CurNxt AS
(
  SELECT
    Cur.student     AS student1,
    Cur.dt         AS dt1,
    Cur.slot       AS slot1,
    Cur.attend     AS attend,
    Cur.pseudodate AS pseudo1,
    Nxt.dt         AS dt2,
    Nxt.slot       AS slot2,
    Nxt.pseudodate AS pseudo2,
    ROW_NUMBER() OVER(PARTITION BY Cur.student
                      ORDER BY Cur.pseudodate) AS seq
  FROM Pseudodates AS Cur
    JOIN Pseudodates AS Nxt
      ON Cur.student = Nxt.student
      AND Cur.pseudodate = Nxt.pseudodate - 1
      AND Cur.attend != Nxt.attend
)
-- match rows with offset of one between row numbers
-- collect from/to elements
-- calculate count
SELECT
  F.student1 AS student,
  F.dt2 AS from_dt,
  F.slot2 AS from_slot,
  T.dt1 AS to_dt,
  T.slot1 AS to_slot,
  T.attend AS attend,
  T.pseudo1 - F.pseudo1 AS cnt
FROM CurNxt AS F
  JOIN CurNxt T
    ON F.student1 = T.student1
    AND T.seq - 1 = F.seq;

Cheers,
--
BG

Comments

Plain text