Ranking Functions

In my May article “Calculating Row Numbers in SQL Server 2005,” I discussed how to calculate row numbers in deterministic scenarios by using the ROW_NUMBER() function, one of four new analytical functions in SQL Server 2005. Here, I’ll complete the discussion about row numbers in a nondeterministic scenario, then I’ll show you the other three new analytical functions in SQL Server 2005—RANK(), DENSE_RANK() and NTILE(). Listing 1 creates and populates the Orders table that I use in my examples.

Nondeterministic Row Numbers

Row numbers are deterministic when the sorting criteria are unique (e.g., orderid)—as long as the query’s input (the table) doesn’t change, the query’s output (the specific row numbers) will be the same. However, if the sorting criteria aren’t unique (e.g., qty), the result isn’t deterministic. For example, the following SQL Server 2005 query specifies the nonunique qty column as the sorting criteria and returns all the Orders table’s row numbers, as Figure 1 shows:

SELECT ROW_NUMBER() OVER(ORDER BY qty) AS rownum, qty, orderid
FROM dbo.Orders
ORDER BY rownum

Notice that duplicate values exist in Figure 1’s qty column. The minimum qty value, 10, appears four times. The four rows with the qty value 10 will get row numbers 1 through 4, but you have no guarantee of which row will get which row number. Adding one or more columns to the sorting criteria (e.g., qty and orderid) makes the criteria unique and gets a deterministic result. See “Calculating Row Numbers in SQL Server 2005” for examples of how to calculate deterministic row numbers by using a combination of columns in the sorting criteria.

Guaranteeing deterministic results is important for some applications. For example, if you use row numbers to return results in pages (a different portion of the result at a time), and you ask for rows 1 through 3 in one query and rows 4 through 6 in another query, you don’t want to return the same rows twice. But, if you specify a nondeterministic sorting criteria, such as the qty column, order 30001 might get row number 1 in the first run of the query, and row number 4 in the second run, thus falling on different pages in the query’s first and second run. If you specify deterministic sorting criteria, you ensure that the same row will have the same row number every time you run the query.

The RANK() and DENSE_RANK() Functions

The RANK() and DENSE_RANK() functions are two other new analytical ranking functions in SQL Server 2005. The functions are similar to the ROW_NUMBER() function because they produce ranking values according to sorting criteria, optionally partitioned into groups of rows. (For details about how to use the PARTITION BY clause to calculate analytical ranking functions within groups of rows, see “Calculating Row Numbers in SQL Server 2005.”) However, the ROW_NUMBER() function produces a different value for each row. The RANK() and DENSE_RANK() functions produce the same value for all rows that have the same values in the sort column. In other words, the RANK() and DENSE_RANK() functions are always deterministic. For example, if you use the qty column as the sorting criteria for all rows in the Orders table, Figure 2’s rnk and drnk columns show the RANK() and DENSE_RANK() functions’ result values.

The RANK() function finds how many rows have qty values lower than the active row’s qty value, then adds 1. For example, order 30004 has a quantity of 20. There are five orders with smaller quantities, so the rank value for order 30004 is 6 (5+1). The DENSE_RANK() function shows how many distinct qty values are lower than the active row’s qty value plus 1. Here, order 30004 gets a dense rank value of 3 (2+1) because there are only two distinct quantities smaller than 20—10 and 15. Put simply, the RANK() function tells you how many rows have values that are smaller than the current one and the DENSE_RANK() function tells you how many values are smaller than the current one. From these definitions you can infer that RANK() might have gaps between the results (1,1,1,1,5,6,6,6,9,9,11) and DENSE_RANK() won’t have gaps (1,1,1,1,2,3,3,3,4,4,5). The first four orders get a rank value of 1, and the fifth order gets the value 5. No orders get the rank values 2, 3, or 4. Dense rank doesn’t have those gaps—the first four orders get the dense rank value 1 and the fifth order has dense rank value 2.

In SQL Server 2000, to calculate rank and dense rank values, you have to translate the function definitions into T-SQL queries. Listing 2 gets from SQL Server 2000 the desired results that Figure 2 shows. The code’s first subquery counts the number of qty values in the Orders table lower than the active row’s qty value and adds 1 to produce the values in the rnk column. The code’s second subquery counts the number of distinct qty values lower than the active row’s qty value and adds 1 to produce the values in the drnk column.

Techniques for calculating rank and dense rank in SQL Server 2000 are slow. In SQL Server 2005, the RANK() and DENSE_RANK() functions scan the data only once, which is a big performance improvement. The following SQL Server 2005 query gets the desired result set more quickly than the SQL Server 2000 query:

SELECT orderid, qty,
  RANK()       OVER(ORDER BY qty) AS rnk,
  DENSE_RANK() OVER(ORDER BY qty) AS drnk
FROM dbo.Orders AS O1
ORDER BY qty

NTILE

NTILE() is the fourth new analytical ranking function that SQL Server 2005 introduces. The NTILE() function lets you separate rows into a desired number of groups according to specified sorting criteria, optionally within partitions. For example, let’s say you want to separate the orders in the Orders table into four groups, using the qty column as the sorting criteria to analyze each of the four groups of orders separately. The NTILE() function generates a tile (group) number (ranging from 1 to 4) for the different orders according to quantity. The N in the function’s name represents a variable number of groups that you specify as the function’s argument. For example, you can separate the rows into 3 groups by specifying NTILE(3). In this case, N is equal to 3. The Orders table has 11 rows. To calculate the number of rows in each group, you need to divide the number of rows by the number of requested groups. Dividing 11 by 4 gives a quotient of 2 with a remainder of 3—so three groups will contain 3 rows, and the remaining group will contain 2 rows. According to ANSI SQL-99, the NTILE() function adds an additional row to each of the first r groups, where r is the remainder of num_rows ¸ tile_size. The following SQL Server 2005 code calculates NTILE(4) values for the rows in the Orders table, using qty as the sorting criteria:

SELECT orderid, qty,
  ROW_NUMBER() OVER(ORDER BY qty) AS rownum,
  NTILE(4)     OVER(ORDER BY qty) AS ntile4
FROM dbo.Orders AS O1
ORDER BY rownum

Figure 3 shows the query’s results. Notice that the first three groups have three rows each and the fourth group has two rows. If you run the same query again, this time requesting eight tiles, the first three groups will have 2 rows each and the rest will have one row each, as Figure 4 shows.

With SQL Server 2000 and earlier releases, writing a query equivalent to NTILE() is possible but complex because the remaining rows from the division num_rows ¸ tile_size aren’t evenly distributed by the ANSI-TILE algorithm among the different groups—the algorithm adds the rows to the first groups. A solution that uses an even-distribution algorithm is simpler to write and more correct statistically than the ANSI definition of NTILE(), which doesn’t use an even-distribution algorithm. Listing 3 shows a SQL Server 2000 query that implements NTILE(8) with an even distribution of a larger group. Figure 5 shows the result.

The query that generates the derived table RN uses techniques I discussed last month and earlier in this article to calculate row numbers (rownum column) for each row in the Orders table. The derived table TS has a single row and a single column (tilesize) that returns the tile size (num_rows ¸ 8). The outer query performs a cross join between RN and TS to match all rows from RN with the tile size from TS, thus letting expressions in the query’s SELECT list refer to columns from both derived tables. With rownum and tilesize values available, calculating the tile number is easy—subtract 1 from rownum, divide the result by tilesize, then add 1. Finally, floor the result (truncate the fraction). Again, SQL Server 2005’s NTILE() function runs faster than SQL Server 2000’s alternative because it scans the data only once.

A Wish Fulfilled

T-SQL programmers have wanted a way to calculate row numbers and other ranking values in SQL Server for a long time. SQL Server 2005’s new analytical functions let you write short and efficient code, compared to SQL Server 2000 alternatives, which are long, complex, and slow. So far, I’ve covered only a small portion of the T-SQL enhancements in SQL Server 2005. There’s plenty more to come!

Comments

Plain text