Assigning Row Numbers for Non-Unique Rows

Downloads
45828.zip

I discussed the concept of row numbers many times in my T-SQL Black Belt articles because row numbers have numerous practical applications. You can find details about the ANSI set-based techniques that let you calculate row numbers in SQL Server 2005 and SQL Server 2000 in the article "Calculating Row Numbers in SQL Server 2005," April 2004, InstantDoc ID 42302.

In all the ANSI set-based techniques I discussed, I always relied on a sort column list that uniquely identifies a row. The sort column list is the attribute list used to determine the order of assignment of the row numbers. For example, if you need to calculate row numbers for orders based on order date and order ID, the sort column list is (orderdate, orderid). This month's T-SQL puzzle adds a twist to calculating row numbers: dealing with non-unique rows. This twist makes the task more complex and requires you to apply more sophisticated logical deduction.

I use a generic form of a table to demonstrate my code so that you concentrate on the technical and logical aspects of the problem and not on a specific scenario. Run the code that Listing 1 shows to create a table called T1. This table has a single column called c1 and is populated with several duplicate rows.

Your first task is to write an ANSI compliant set-based solution that returns the rows from T1 along with row numbers according to c1 order. The solution must be SQL Server 2000 compatible, which means that you can't use SQL Server 2005's new ROW_NUMBER function. You can't use the IDENTITY property or function because they're not ANSI compliant. Table 1 shows the desired result.

After you find a solution, try to enhance it to support multiple sort columns that aren't unique. Use the code that Listing 2 shows to recreate the T1 table with three columns (c1, c2, and c3) and populate it with sample data.

Your second task is to calculate row numbers for the non-unique rows in T1 based on c1, c2, c3 order. Table 2 shows the desired result.

Solution for a Single Non-Unique Sort Column

The toughest part of this T-SQL puzzle is the fact that there are duplicate rows in T1 and there's no logical way to distinguish one from another. The key to the solution is a collapse-expand logic, in which you collapse the rows to generate a unique result set and expand them back to generate a unique identifier for each duplicate row. In my solution, I'm using an auxiliary table of numbers, which you can create and populate by running the code that Listing 3 shows.

The solution has three steps:

1. Use a simple GROUP BY operation to collapse the duplicate rows into unique rows and calculate the number of rows in each group with a COUNT() aggregate to get the number of duplicates of each unique base row. Call that count dups. For each unique row, use a subquery in the SELECT list to count the number of base rows that have a smaller sort value. Call that count smaller. Thus, the query for this step is

SELECT c1, COUNT(*) AS dups,
  (SELECT COUNT(*) FROM T1 AS B
  WHERE B.c1 < A.c1) AS smaller
FROM T1 AS A
GROUP BY c1;

This query generates the result set that Table 3 shows. This result set tells you that

A appears four times in T1 and there are no rows with smaller sort values
B appears five times and there are four smaller values
C appears three times and there are nine smaller values

2. Join the table returned from the GROUP BY operation with an auxiliary table of numbers based on the join condition n <= dups. This join expands and duplicates each unique base row the number of times specified by the dups value and returns a different value in the n column for each duplicate representing a sequential duplicate number (e.g., 1st, 2nd, 3rd). The query for this step looks like

SELECT c1, smaller, n
FROM (SELECT c1, COUNT(*) AS dups,
  (SELECT COUNT(*) FROM T1 AS B
  WHERE B.c1 < A.c1) AS smaller
  FROM T1 AS A
  GROUP BY c1) AS D
JOIN Nums ON n <= dups;

As the result set in Table 4 shows, each unique row is duplicated as many times as the original number of duplicates it had in T1. But now you can distinguish one row from another because each row has a different duplicate number. The significance of the smaller value will be apparent in the next step.

3. Calculate the result row number as smaller + dups by using the query

SELECT c1, smaller + n AS rownum
FROM (SELECT c1, COUNT(*) AS dups,
  (SELECT COUNT(*) FROM T1 AS B
  WHERE B.c1 < A.c1) AS smaller
  FROM T1 AS A
  GROUP BY c1) AS D
JOIN Nums ON n <= dups;

If you think about it, the row number can be logically expressed as number of rows with a smaller sort value + the sequential duplicate number

The most confusing part here is probably the join with the Nums table. After collapsing the rows into groups, the purpose of this join is to expand the rows back to the original number of duplicates each row had originally and generate a unique identifier (the duplicate number) for the different rows within each group. The unique identifier within the group is the n column taken from Nums.

Solution for Multiple Non-Unique Sort Columns

The solution for multiple non-unique sort columns is conceptually the same as the one for a single non-unique sort column. The only difference is in calculating the smaller value. Remember that this value represents the number of rows with smaller sort values, only now the smaller value is based on multiple attributes. In the solution for a single sort column, you use the expression

B.c1 < A.c1

For multiple sort columns you use the expression

(B.c1 < A.c1) OR (B.c1 = A.c1 
AND B.c2 < A.c2)
  OR (B.c1 = A.c1 AND B.c2 = A.c2
  AND B.c3 < A.c3)

The parentheses here are used just for clarity. The default order of precedence of the logical operators allows you to omit them in this case. That's basically all there is to it. So, for multiple sort columns, you revise the solution query for the first step to the following query (don't forget to run the code in Listing 2 to recreate the T1 table with multiple columns):

SELECT c1, c2, c3, COUNT(*) AS dups,
  (SELECT COUNT(*) FROM T1 AS B
  WHERE B.c1 < A.c1
  OR B.c1 = A.c1 AND B.c2 < A.c2
  OR B.c1 = A.c1 AND B.c2 = A.c2 
  AND B.c3 < A.c3) AS smaller
FROM T1 AS A
GROUP BY c1, c2, c3;

This query produces the collapsed result set that Table 5 shows. After you use this query to create the derived table D, you perform a cross join between D and the Nums table and calculate the row numbers the same way as you did previously:

SELECT c1, c2, c3, 
smaller + n AS rownum
FROM (SELECT c1, c2, c3, COUNT(*) AS dups,
  (SELECT COUNT(*)
  FROM T1 AS B
  WHERE B.c1 < A.c1
    OR B.c1 = A.c1
    AND B.c2 < A.c2
    OR B.c1 = A.c1
    AND B.c2 = A.c2
    AND B.c3 < A.c3)
    AS smaller
  FROM T1 AS A
  GROUP BY c1, c2, c3) AS D
JOIN Nums ON n <= dups;

Solve Problems by Finding the Logical Key

Some T-SQL querying problems are tough to solve unless you figure out a logical key to open the door to the solution. In this example, the key is collapse-expand logic. To see another example of a problem that requires a key logical idea to solve it, check out the solution to the Dots on a Cube puzzle in "The Logical Puzzle" sidebar. You'll also find a new logical challenge—Rectangles in a Square puzzle—in the sidebar.

Comments

Plain text