Skip navigation

Making TOP Better

In the first article in this series, "TOP Troubles," January 2006, InstantDoc ID 48616, I talked about the problematic nature of the TOP option as implemented in SQL Server 2005. Specifically, I discussed the ambiguity of the ORDER BY clause when you use TOP. ORDER BY serves as the logical filter for TOP—determining which rows will be returned—but ORDER BY also determines the order of the rows in the output cursor. When you use TOP, the two sorting functions aren't separated, which gives you less control over your query results. I asked you to consider how you'd design a new TOP option for SQL Server, a TOP that's unambiguous and provides full control over the different sorting functions. Here I offer my own suggestion for creating an improved TOP, and I'll also describe some alternative techniques supported in SQL Server 2005 that provide TOP’s missing functionality.

The Ideal TOP


If you think about it, SQL Server 2005 implements ANSI SQL:1999 constructs that separate the two different sorting functions you might need for some queries: controlling the calculation order and sorting the output. You might have already guessed that I’m referring to the OVER clause provided with ranking functions. Within an OVER clause's parenthesis, you can specify the order in which you want the ranking values to be calculated. For example, the following query calculates row numbers for orders according to OrderID sort:

 SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders; 

Remember that here, ORDER BY serves only one function: determining the order of calculation of row numbers. Since no ORDER BY clause is specified in the outer query, the result is a valid table instead of a cursor; hence there’s no guarantee that the output rows will be in any particular order. Furthermore, such a query is allowed as a table expression.

In contrast, the following query has a separate ORDER BY clause:

 SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders
ORDER BY OrderID; 

So you get a cursor back rather than a table, and the order of the rows in the output is guaranteed. Also, such a query isn't allowed as a table expression. Furthermore, remember that you can specify a PARTITION BY clause with ranking functions in which you request that the calculations are performed for partitions or groups of rows independently.

Now think about it... wouldn’t it make perfect sense to support an OVER clause with the TOP option, very similar to ranking functions? One that supports an ORDER BY clause to serve as TOP’s logical filter only? Then, an outer ORDER BY clause would serve only as the output-sorting tool. For example, consider the following pseudo query:

 SELECT TOP(3) OVER(ORDER BY OrderID)
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders; 

Here you ask for the three rows with the lowest OrderID values, but since you didn’t specify an outer ORDER BY clause, the result would be a table. On the other hand, specifying an outer ORDER BY clause, as the following pseudo code example shows, would guarantee the order of the output rows and return a cursor:

 SELECT TOP(3) OVER(ORDER BY OrderID)
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders
ORDER BY OrderID; 

Another advantage in separating the two functions of the ORDER BY clause is that you could use one ORDER BY list as TOP’s logical filter and have the output sorted in a different manner, like this pseudo code:

 SELECT TOP(3) OVER(ORDER BY OrderID)
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders
ORDER BY OrderDate; 

Supporting a PARTITION BY clause would also make perfect sense here. This type of clause would let you provide solutions for “TOP n for each group” requests. For example, you could return the three orders with the lowest order IDs for each customer by using a statement like this pseudo code:

 SELECT TOP(3) OVER(PARTITION BY CustomerID ORDER BY OrderID)
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders
ORDER BY CustomerID, OrderID; 

Real-World TOP Alternatives


My previous pseudo code examples illustrate the imaginary "TOP on steroids" that I'd have liked Microsoft to implement in SQL Server 2005. Unfortunately, these TOP fantasies won't solve practical needs. The good news is that alternatives exist in SQL Server 2005 to provide the functionality that I believe TOP lacks. This more practical "TOP" is actually based mainly on the ROW_NUMBER function. You calculate row numbers according to any given ordering and partitioning requirements, create a table expression out of the query with the row number calculation, then in an outer query filter only the row numbers that you need. For example, the following code returns the three orders with the lowest order IDs:

 WITH OrdersRN AS
(
  SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
    OrderID, CustomerID, EmployeeID, OrderDate
  FROM dbo.Orders
)
SELECT *
FROM OrdersRN
WHERE RowNum <= 3; 

Notice that here, ultimately the ORDER BY clause serves only one function—filtering. The query returns a valid table and not a cursor. If you want the output sorted, you can specify an ORDER BY clause in the outer query, and the outer query doesn’t have to have the same ORDER BY clause as in the inner query. For example, the following sample code uses this technique to sort the output by OrderDate:

 WITH OrdersRN AS
(
  SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
    OrderID, CustomerID, EmployeeID, OrderDate
  FROM dbo.Orders
)
SELECT *
FROM OrdersRN
WHERE RowNum <= 3
ORDER BY OrderDate; 

And if you need to partition the data and return the three orders with the lowest order IDs per customer, just add a PARTITION BY clause to do so, like this:

 WITH OrdersRN AS
(
  SELECT ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY OrderID) AS RowNum,
    OrderID, CustomerID, EmployeeID, OrderDate
  FROM dbo.Orders
)
SELECT *
FROM OrdersRN
WHERE RowNum <= 3
ORDER BY CustomerID, RowNum; 

The ROW_NUMBER technique performs extremely efficiently because it calculates row numbers based on a single scan of the data. Furthermore, SQL Server’s optimizer has sophisticated logic to handle row numbers and their interaction with other query elements (e.g., filtering, sorting). For example, when you filter rows WHERE RowNum

Work Around the Limitations


I've suggested an alternative design for SQL Server 2005's TOP option that I feel is stronger and more straightforward than the current TOP implementation. Although it's highly unlikely that Microsoft will change the design of TOP anytime soon, you can use my alternative techniques to supplement TOP and provide the functionality that it lacks. Next month, I’ll continue discussing TOP and provide solutions for TOP data modifications on steroids.

TAGS: SQL
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish