Calculating Row Numbers in SQL Server 2005

SQL Server 2005, formerly code-named Yukon, will introduce an overwhelming set of new features and enhancements to different aspects of the product. One of SQL Server 2005’s major new features is its integration of the .NET framework, which will let you write stored procedures, triggers, and user-defined functions (UDFs) in any supported .NET language, such as C# or Visual Basic .NET (VB.NET). But don’t assume that T-SQL will be obsolete. The ability to develop .NET routines will complement T-SQL development rather than replacing it.

Related: Nondeterministic Row Numbers

In some scenarios, .NET code will be appropriate; in other scenarios, T-SQL code will be appropriate. For example, T-SQL is still the most powerful and efficient language for data access, and .NET is stronger in areas that T-SQL wasn’t designed to handle (e.g., complex algorithms, iterative logic, mathematical computations). Given the wealth of new T-SQL features and enhancements in SQL Server 2005, Microsoft obviously believes in T-SQL’s future. This column is going to focus on T-SQL development in SQL Server 2005, and compare SQL Server 2000’s T-SQL techniques with SQL Server 2005’s techniques. This month I’ll discuss the new ROW_NUMBER() function, which is one of four new analytical ranking functions (as defined by the ANSI SQL-99 OLAP extensions) that SQL Server 2005 implements.

Row Numbers

Most T-SQL programmers want to know how to calculate row numbers, but the need for row numbers is a source of passionate debate among relational database theorists and T-SQL practitioners. Row numbers are consecutive integers that your code assigns to a query’s result rows, signifying the position of a row among the other result rows according to specified order. Relational database theorists say that a table is a logical entity that represents a set, and a set has no order to its rows. So there’s no sense in assigning row numbers that signify a row position among other rows. Row numbers imply sorting and a physical view of the data. I won’t argue my side in the debate here. I’m covering the subject because many programmers need to calculate row numbers for an endless number of practical applications such as paging, TOP scenarios, ranking, and others.

An important principle to remember when writing T-SQL queries is that because a table is a logical entity representing a set, table rows have no inherent order. Calculating row numbers requires a defined order. The order-by list—list of expressions that determine the sort order—can be a single column or a combination of columns. If you want to get a deterministic result, meaning that there’s only one correct result for your query, your order-by list must uniquely identify a row. Here, I’ll cover deterministic order; next time, I’ll delve into non-deterministic order. Let’s start with our first scenario—calculating row numbers based on one unique column. Note that in the following examples, I use an Orders table that the code in Listing 1 creates and populates in both SQL Server 2000 and SQL Server 2005.

Row Numbers Based on One Unique Column

Let’s say you want to calculate row numbers for the orders in the Orders table based on the unique column orderid. Figure 1 shows the desired result. Using the following ANSI-compliant technique to calculate row numbers in SQL Server 2000 produces the desired result, but inefficiently:

SELECT
  (SELECT COUNT(*)
  FROM dbo.Orders AS O2
  WHERE O2.orderid <= O1.orderid) AS rownum,
  orderid, CONVERT(varchar(10), orderdate, 120) AS orderdate,
  empid, custid, qty
FROM dbo.Orders AS O1
ORDER BY orderid

The query retrieves all rows from one occurrence of Orders, called O1, and uses a subquery in the SELECT list to calculate the row numbers. The subquery counts the number of rows in a second occurrence of Orders, O2, that have an orderid less than or equal to the current orderid in O1. The row with the minimum orderid (10001) will have one match; the row with the next orderid (10005) will have two matches; the next will have three matches, and so on. Essentially, the number of matches is the row number. Had the column determining the order not been unique, you would have gotten duplicate values from the subquery for rows with the same value in the orderid column.

The ANSI-compliant technique is inefficient even if you have an index on orderid, because for each row, the number of index rows scanned is at least the result row number. So, for n rows in the table, SQL Server scans at least (1+n)/2*n index rows. For example, my laptop took more than half an hour to run the ANSI-compliant query on a table with 100,000 rows. Fortunately, SQL Server 2005 introduces an elegant and efficient way to calculate row numbers: using the ROW_NUMBER() function. The ORDER BY list in the ROW_NUMBER() function’s OVER clause specifies the sorting criteria for the function:

SELECT ROW_NUMBER() OVER(ORDER BY orderid) AS rownum,
  orderid, CONVERT(varchar(10), orderdate, 120) AS orderdate,
  empid, custid, qty
FROM dbo.Orders AS O1
ORDER BY orderid

Without an index, the ROW_NUMBER() function in SQL Server 2005 scans the table only once, then sorts the rows to calculate all the row numbers. The solution in SQL Server 2000 scans the table once per each base row. I already mentioned that it took the SQL Server 2000 query more than half an hour to run on my laptop with 100,000 rows in the table. In comparison, the performance of the SQL Server 2005 query with the same table is simply astonishing—only 1 second.

Row Numbers Based on a Unique Combination of Columns

Now let’s say you want to calculate row numbers based on the orderdate column, using orderid as a tiebreaker. Figure 2 shows the desired result. The following query generates the desired result in SQL Server 2000:

SELECT
  (SELECT COUNT(*)
  FROM dbo.Orders AS O2
  WHERE O2.orderdate

The subquery counts the number of rows in O2 that have an earlier orderdate than the one in the current row in O1—or that have the same orderdate and a smaller or equal orderid. Effectively, you get the desired row number. But this query has performance problems similar to the previous SQL Server 2000 query. In the SQL Server 2005 solution, you simply specify both the orderdate and the orderid columns in the ROW_NUMBER() function’s ORDER BY list:

SELECT ROW_NUMBER() OVER(ORDER BY orderdate, orderid) AS rownum,
  CONVERT(varchar(10), orderdate, 120) AS orderdate,
  orderid, empid, custid, qty
FROM dbo.Orders AS O1
ORDER BY orderdate, orderid

Row Numbers Within Partitions

Sometimes you want to calculate row numbers within partitions or subsets. For example, you assign row numbers for each customer’s rows based on orderid order, as Figure 3 shows. In SQL Server 2000, you use a technique like the one we used to calculate non-partitioned row numbers based on the order of orderid. In addition, you need to correlate O2 (the inner occurrence of Orders in the subquery) to O1 (the outer occurrence of Orders) based on custid equivalence, by using the following query:

SELECT custid,
  (SELECT COUNT(*)
  FROM dbo.Orders AS O2
  WHERE O2.custid = O1.custid
    AND O2.orderid

To get reasonable performance, you need an index on custid and orderid. If c=number of customers and m=number of orders per customer (assuming even distribution), the number of index rows SQL Server scans is at least c*(1+m)/2*m. This query is much faster than the non-partitioned solution, but the solution is still less efficient than the SQL Server 2005 alternative. SQL Server 2005 provides a PARTITION BY clause in which you specify the partitioning column(s):

SELECT custid,
  ROW_NUMBER() OVER(
    PARTITION BY custid  ORDER BY orderid) AS custrownum,
  orderid, CONVERT(varchar(10), orderdate, 120) AS orderdate,
  empid, qty
FROM dbo.Orders AS O1
ORDER BY custid, orderid

And of course, you can use the techniques I described earlier for multi-column sorting in the non-partitioned solutions if you need multiple columns to determine the sort order within partitions.

Using Listing 2 in SQL Server 2000 assigns row numbers partitioned by custid, with sorting criteria of orderdate, then orderid. Figure 4 shows this query’s result. Here’s SQL Server 2005’s partitioned, multiple sort columns query:

SELECT custid,
  ROW_NUMBER() OVER(
    PARTITION BY custid ORDER BY orderdate, orderid) AS custrownum,
  CONVERT(varchar(10), orderdate, 120) AS orderdate,
  orderid, empid, qty
FROM dbo.Orders AS O1
ORDER BY custid, orderdate, orderid

Less Is More

Solutions for calculating row numbers in SQL Server 2005 are simpler, more efficient, and involve less code than SQL Server 2000 solutions. Next time, I’ll cover non-deterministic scenarios, then describe the other three analytical ranking functions: RANK, DENSE_RANK and NTILE.

Comments

Plain text