T-SQL Challenge – Efficient Partitioned TOP | ITPro Today: IT News, How-Tos, Trends, Case Studies, Career Tips, More

This challenge involves writing a query against a table called T1 which you create and populate by running the following code:

SET NOCOUNT ON;

USE tempdb;

IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;

CREATE TABLE dbo.T1

(

id INT NOT NULL IDENTITY PRIMARY KEY,

grp VARCHAR(10) NOT NULL,

col1 INT NOT NULL,

col2 INT NOT NULL

);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, -2147483648);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, -1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 2147483647);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 2147483647);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, -2147483648);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, -1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, 0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 100);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 200);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 30);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 50);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -100);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, 0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -200);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -30);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -50);

Your task is to write a query that returns for each unique grp value one row, with the highest col1 value, and lowest col2 value among the rows with the highest col1 value. The desired result for the given sample data is:

grp col1 col2

---------- ----------- -----------

A 10 -2147483648

B 42 0

The TOP option doesn’t support an OVER clause, but if it did, the solution might have looked like this:

SELECT TOP (1) OVER(PARTITION BY grp ORDER BY col1 DESC, col2)

grp, col1, col2

FROM dbo.T1;

The above code of course doesn’t work in SQL Server (unfortunately), and is given here as conceptual pseudo code. For now, you can apply similar logic by using the ROW_NUMBER function like so:

WITH C AS

(

SELECT grp, col1, col2,

ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum

FROM dbo.T1

)

SELECT grp, col1, col2

FROM C

WHERE rownum = 1;

This solution produces the desired result; however, it doesn’t perform well when T1 has a large number of rows and there’s no index on (grp, col1 DESC, col2). In such a case, the execution plan for this solution involves sorting which is very expensive. You can see for yourself by running this solution against a large set of rows. You can use the following code to populate T1 with 10,000,000 rows:

-- Large set of sample data

-- DDL for GetNums Function

IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL

DROP FUNCTION dbo.GetNums;

CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE

RETURN

WITH

L0 AS(SELECT 1 AS c UNION ALL SELECT 1),

L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)

SELECT n FROM Nums WHERE n <= @n;

-- Populate T1 with 10,000,000 rows in 10,000 groups

TRUNCATE TABLE dbo.T1;

INSERT INTO dbo.T1 WITH(TABLOCK) (grp, col1, col2)

SELECT

'Group' + CAST(ABS(CHECKSUM(NEWID()))%10000 + 1 AS VARCHAR(10)) AS grp,

CHECKSUM(NEWID()) % 100 AS col1,

CHECKSUM(NEWID()) AS col2

FROM dbo.GetNums(10000000) AS Nums;

Remember that in reality you won’t always be able to create the ideal indexes to support every possible request in your system, and sometimes you will choose not to create new indexes taking into consideration the negative impact those will have on modifications.

So, the challenge is: can you think of a better performing solution than the one I presented earlier based on the ROW_NUMBER function, with the given large set of sample data, and assuming you cannot create any new indexes?

Note that you are not guaranteed that the values in col1 and col2 are positive; these columns can contain negative, zero, and positive values.

Please post your solution as a comment to this entry plus e-mail it to me ([email protected]).

I’ll post an entry with the solutions next week.

Cheers and good luck!

Comments

Plain text