TOP Troubles

Although the T-SQL TOP option in SQL Server 2005 doesn’t conform to the ANSI SQL standard, it nonetheless meets some important practical needs of SQL Server users; for example, in providing a tool to answer requests such as returning information about the x most recent orders. Unfortunately, the current implementation of TOP has several limitations. Its paradigm is fuzzy and not in accord with the relational model, and TOP lacks support for some key functionality that SQL Server practitioners greatly need. This article kicks off a series in which I’ll explore TOP’s limitations; suggest an alternative, more robust and complete implementation of TOP, and discuss options available in SQL Server 2005 that address TOP’s shortcomings.

I shared my thoughts about TOP with the Microsoft SQL Server development team. If you have suggestions for features you'd like to see in future SQL Server versions, you can submit them to [email protected]. Program managers from the SQL Server development team monitor this mailbox and consider each request. Submitting such requests gives you an opportunity to affect the development of future SQL Server versions and be proactive in making your needs known to Microsoft.

The Relational Model

Before delving into TOP, let's start by discussing some basic terms and definitions related to ANSI SQL and the relational model that are crucial to understanding what's missing in TOP. The relational model, which is based on set theory, and in turn ANSI SQL, which is based on the relational model, deal with logical entities represented in your database as tables. A table is a set, and a set has no predetermined order to its rows.

ANSI SQL distinguishes a table from a cursor. (Don’t confuse the term cursor here with cursors that you declare; here, the term cursor is more conceptual and not specific to the cursors that you can declare.) A table is logical and has no predetermined order to its rows, whereas a cursor is physical and has some order to its records. A query expects a table as input in the FROM clause:

SELECT  <col_list> FROM  <table>;

Per ANSI, a query that has no ORDER BY clause returns a table, with no guarantee of any specific order. A valid table must also have column names for all result columns, and the result column names must be unique. A query that meets these three requirements (no ORDER BY, all result columns have names, all result column names are unique) yields a valid table and can be used as a table expression.

A table expression is any form of an encapsulated query that yields a valid table, including a view, inline table valued function, derived table, and common table expression (CTE). You can specify a table expression as an input to an outer query, typically in the FROM clause, as the following example, which uses a derived table (run in the Northwind database), shows:

SELECT *
FROM (SELECT OrderID, CustomerID, EmployeeID, OrderDate
      FROM dbo.Orders) AS D;

On the other hand, a query that has an ORDER BY clause returns a cursor and isn't allowed as a table expression. Therefore, the following attempt fails:

SELECT *
FROM (SELECT OrderID, CustomerID, EmployeeID, OrderDate
      FROM dbo.Orders
      ORDER BY OrderID) AS D;

A cursor is nonrelational since it’s a physical object that has order to its records. The client application expects a physical object (cursor/record set); thus a query that includes an ORDER BY clause is allowed if the result is returned back to the client, as opposed to being served as input to an outer query, as this example shows:

SELECT OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders
ORDER BY OrderID;

In fact, the result of a query that has no ORDER BY clause (although it's a valid relational table) must be converted to a cursor if that result is returned to the client:

SELECT OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders;

But keep in mind that in such a case, no particular order is guaranteed. Per ANSI, when an ORDER BY clause isn't specified, the database product (i.e., SQL Server) can return the rows in any order.

So far, you’ve seen only one function of the ORDER BY clause—determining the physical order of records in a cursor. However, ANSI defines another function for the ORDER BY clause—logically determining precedence of rows for window-based calculations. In previous T-SQL 2005 articles, I've discussed ranking functions (e.g., ROW_NUMBER) and other window-based calculations (e.g., aggregates). (For more information about these topics, see "OVER Clause Simplifies Aggregate Window Calculations," December 2005, InstantDoc ID 48297; "Ranking Functions," May 2004, InstantDoc ID 42646; and "Calculating Row Numbers in SQL Server 2005," April 2004, InstantDoc ID 42302.) Queries that perform those operations include an OVER clause in which you use ORDER BY for logical reasons, such as the following sample query:

SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders;

Here, ORDER BY determines the precedence of rows based on OrderID for the row-number calculation. Still, the result of the query is a set or a valid table and guarantees no particular order. Thus, such a query is allowed in a table expression, as the following sample CTE shows:

WITH OrdersRN AS
(
  SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
    OrderID, CustomerID, EmployeeID, OrderDate
  FROM dbo.Orders
)
SELECT * FROM OrdersRN;

Conversely, the following query returns a cursor that has a predetermined order and, therefore, isn't allowed as a table expression and fails:

WITH OrdersRN AS
(
  SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
    OrderID, CustomerID, EmployeeID, OrderDate
  FROM dbo.Orders
  ORDER BY OrderID
)
SELECT * FROM OrdersRN;

Remember that ANSI also lets you specify a PARTITION BY clause to limit the window that will be used for the window-based calculation. For example, the following query calculates row numbers based on OrderID order for each CustomerID separately:

SELECT ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY OrderID) AS RowNum,
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders;

Again, because this query has no ORDER BY clause to determine the order of the result, the query returns a table. When the result is converted to a cursor before it's returned to the client, you have no guaranty of any particular order.

To sum up, you should keep in mind the following terms, which are important for our discussion about TOP: table, cursor, and table expression. Also, remember that per ANSI, ORDER BY can serve one of two functions: determining physical order of records in a cursor or determining logical order and precedence of rows in window-based calculations (but not both uses at the same time). You need to specify a different ORDER BY clause for each function. For example, in the following query you calculate row numbers based on OrderID logical order, while sorting the output by CustomerID and OrderDate:

SELECT ROW_NUMBER() OVER(ORDER BY OrderID) AS RowNum,
  OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders
ORDER BY CustomerID, OrderDate;

Finally, also remember that window-based calculations can have a PARTITION BY clause that limits the window provided to the calculation as input.

TOP Conundrum

The implementation of TOP in SQL Server 2005 diverges in some respects from the relational model and ANSI SQL. Let's look at a few sample queries that highlight some of TOP's inconsistencies. Consider, for instance, this query:

SELECT TOP(3) OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders
ORDER BY OrderID;

Here, the ORDER BY clause serves two functions: It's the logical filter for the TOP option, and it determines the physical order of the result. This scheme deviates from the ANSI SQL standard, which differentiates the ORDER BY functions in features that it supports. Is the result of the query a table or a cursor? Well, the answer is fuzzy.

Typically (or per ANSI), a query that has an ORDER BY clause, which determines the order of the result, is supposed to return a cursor and shouldn’t be allowed in a table expression. Then again, there’s no way for you to separate the ORDER BY clause for result sorting and the ORDER BY for logical-filtering functions. So the above query returns a cursor.

However, because SQL Server's designers wanted to allow TOP queries in table expressions, you’re allowed to do something like this:

SELECT *
FROM (SELECT TOP(3) OrderID, CustomerID, EmployeeID, OrderDate
      FROM dbo.Orders
      ORDER BY OrderID) AS D;

Now what’s the function of ORDER BY, and will this query guarantee the order of the result or not? Again, the answer is fuzzy. The only thing you can say for sure is that ORDER BY here serves as the logical filter of the TOP option. However, since the outer query doesn’t have an ORDER BY clause, you can't guarantee that the output will be sorted.

The irony here is that a query with an ORDER BY clause doesn't return a relational result set and therefore isn't allowed in a table expression. However, in using the current nonrelational implementation of the TOP option, the result set becomes relational and is allowed in table expressions. This leads to some obscure abuse of the TOP option, which I’ve seen programmers do, such as this example:

SELECT *
FROM (SELECT TOP(100) PERCENT OrderID, CustomerID, EmployeeID, OrderDate
      FROM dbo.Orders
      ORDER BY OrderID) AS D;

Remember that because the outer query has no ORDER BY clause, technically, SQL Server doesn't have to guarantee any order of the result.

Another limitation of TOP is that it doesn’t support a partitioning element. So, how do you return the three most recent orders for each employee? Certainly, you can use a subquery similar to this:

SELECT OrderID, CustomerID, EmployeeID, OrderDate
FROM dbo.Orders AS O
WHERE OrderID IN
  (SELECT TOP(3) OrderID
   FROM dbo.Orders AS I
   WHERE I.EmployeeID = O.EmployeeID
   ORDER BY OrderDate DESC, OrderID DESC)
ORDER BY EmployeeID, OrderDate DESC, OrderID DESC;

However, such solutions have performance limitations in terms of the way the optimizer can handle them and are limited to tables that have a single column key. Ideally, TOP in SQL Server 2005 would also support a partitioning element.

Finally, as a result of SQL Server 2005's support for modifications by using TOP, things become even fuzzier. Consider the following code:

DELETE TOP(3) FROM dbo.Orders;

Which rows will be deleted? The answer is any three rows that SQL Server happens to stumble into first. TOP with modifications doesn't let you specify an ORDER BY clause, so the term TOP here is really a misnomer. There are cases, of course, where you want to modify rows based on some logical-ordering filter.

Stay Tuned for More TOP

I’d like you to ponder the issues I raise in this article and consider how you'd implement the TOP option if you were its designer. In upcoming articles, I’ll offer my own suggestions for improving TOP and provide alternatives to TOP’s missing functionality that you can implement in SQL Server 2005.

Comments

Plain text