Last week I provided a T-SQL challenge entitled “Efficient Partitioned TOP” involving finding an efficient solution to a given task when you cannot create new indexes on the queried table. You can find the puzzle details here. I gave code to populate the queried table T1 with 10,000,000 rows of sample data. I provided the following solution:
WITH C AS
(
SELECT grp, col1, col2,
ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum
FROM dbo.T1
)
SELECT grp, col1, col2
FROM C
WHERE rownum = 1;
When there’s no index on (grp, col1 DESC, col2) the execution plan for this solution involves sorting the 10,000,000 rows, which is very expensive. Here are the performance measures I got for this solution on my test machine:
Elapsed time: 83221, CPU: 120167, Logical Reads: 42472
The data is scanned once, but the sort operation contributes to the majority of the cost of the query. The challenge was to find a more efficient solution assuming creating a new index is not an option.
Thanks to all those who submitted solutions: Peter Larsson, Plamen Ratchev, Steve Kass, MuadDib, Rudolf, Paula North, Rob Farley, Regan Wick, Adriaan Simons, and cwestervelt.
Most people (Peter Larsson, Plamen Ratchev, MuadDib, Rudolf, Paula North, Regan Wick, Adriaan Simons, cwestervelt) submitted a pretty straightforward solution based on grouping, joining, and then grouping again:
SELECT A.grp, A.col1, MIN(B.col2) AS col2
FROM ( SELECT grp,
MAX(col1) AS col1
FROM dbo.T1
GROUP BY grp ) AS A
JOIN dbo.T1 AS B
ON B.grp = A.grp
AND B.col1 = A.col1
GROUP BY A.grp, A.col1;
The performance measures I got for this solution are:
Elapsed time: 30798, CPU: 52961, Logical Reads: 84584
The plan for this solution shows that the data is scanned twice, but still it is substantially faster than the solution based on the ROW_NUMBER function because it doesn’t involve sorting. Both the aggregations and the join are handled with hashing. With such a large number of rows, handling the aggregations with hashing is more efficient than sorting, albeit the extra scan of the data.
Peter Larsson also provided a couple of other solutions based on similar logic. One using the APPLY operator:
SELECT A.grp, A.col1, B.col2
FROM ( SELECT grp,
MAX(col1) AS col1
FROM dbo.T1
GROUP BY grp ) AS A
CROSS APPLY ( SELECT MIN(C.col2) AS col2
FROM dbo.T1 AS C
WHERE C.grp = A.grp