IDENTITY() Function Isn't Reliable for Imposing Order on a Result Set

T-SQL is a powerful tool, but it doesn't provide native language constructs to do all the operations you need to do, such as creating rank within a result set. I'm not going to describe how to impose order within a result set (read Itzik Ben-Gan's T-SQL 2005 column "Ranking Functions" to see how to rank results in SQL Server 2000 and to learn about SQL Server 2005's new RANK() function). Instead, this tip highlights a dangerous technique for trying to calculate rank that I've seen frequently at customer sites over the past few months. The examples in this tip come from the Microsoft article "INF: How the IDENTITY Function Behaves When It Is Used in SELECT INTO Queries That Have an ORDER BY Clause and a TOP Operator or a SET ROWCOUNT Statement."

Imagine you need to rank the data in the data set that Listing 1 creates based on an ORDER BY clause on Col1. T-SQL doesn't provide a construct for determining rank, so the following technique, which uses the IDENTITY() function, is tempting to many SQL Server users:

SELECT	EmpId, EmpSalary, 
	LastName, SalaryRank 
	=IDENTITY(int, 1, 1)
INTO 	ListOfHighestPaid-
  Employees
FROM	 Employees
ORDER BY	 EmpSalary desc

This query creates a new table that adds a rank column to the result set and tries to use the ORDER BY clause to define rank. For example, if the Employees table contained the three rows that Figure 1 shows, the end user might want the new ListOfHighestPaidEmployees table to order the data as Figure 2 shows. The IDENTITY() function creates a unique value for each row, so you might think that SQL Server would assign the identity values based on the ORDER BY clause. However, the IDENTITY() function doesn't necessarily create its values based on the ORDER BY clause.

The Microsoft article I noted earlier shows several different results that the sample query might return—some answers match the position of rows that the ORDER BY clause specifies, and others don't. You might assume that the query will rank the results based on the columns you list in the ORDER BY clause, but SQL Server might apply the IDENTITY() function to rows before the relational engine applies the ORDER BY clause to the result set. In other words, SQL Server might assign the identity values before it computes the ORDER BY clause. Parallel queries are a good example of why the relational engine might assign identity values before it processes the ORDER BY clause. Say you have more than five rows in your data set and you break the query into four parallel threads. Imagine that the Employees table has 100,000 rows and SQL Server breaks the query into four steps. Each step might be gathering its own set of rows and assigning identity values as SQL Server processes the rows. However, SQL Server wouldn't apply the ORDER BY clause until all four threads are finished gathering rows and SQL Server serializes each of the parallel streams back into a single step. In that context, using IDENTITY() might not give you the results you want.

You might argue that Microsoft should have implemented the IDENTITY() function differently to ensure that it would respect the ORDER BY request. However, that's not how the function works. The good news is that SQL Server 2005's new RANK() and DENSE_RANK() T-SQL extensions will let you write short and efficient code to solve ranking problems.

Comments

Plain text