Over the TOP

Downloads
16191.zip

Understanding the TOP clause's capabilities and limitations

Editor's Note:Send your experts-only T-SQL tips to Itzik Ben-Gan at [email protected]. If we use your tip in the magazine, you'll recieve $100 and an exclusive T-SQL Black Belt shirt.

T-SQL lets you limit the number of rows in a query's result by specifying a search condition in the WHERE clause. The search condition is a logical expression, and it returns only rows for which the logical expression evaluates to TRUE. In some cases, however, you might want to limit the rows in the result to a specific number of rows without specifying a logical condition. To limit the number of rows, you can use the TOP clause in your SQL Server 2000 and 7.0 queries. But TOP has limitations. This article shows how to use the TOP clause and discusses ways to overcome the TOP clause's restrictions, including a method that Dr. Tom Moreau, president of Toronto-based Brockman Moreau Consulting, submitted.

TOP Queries, ANSI, and the Relational Model

Using the TOP n option, you can limit the result of a SELECT query to a fixed number of rows. The TOP n PERCENT option limits the result to a more scalable number of rows or bases the result on the percentage of rows supplied, rounded up to the nearest integer. The TOP WITH TIES option specifies that, in addition to returning the rows that the query would get without the option, you also want the rows with the same values as those in the last row of columns that the ORDER BY clause specifies. For example, to get the five orders that have the lowest OrderID from the Orders table in the Northwind database, you can issue the following query:

SELECT TOP 5 OrderID, CustomerID, EmployeeID, OrderDate
FROM Orders AS O1
ORDER BY OrderID

TOP has little meaning unless you specify the ORDER BY clause. Without ORDER BY, a request such as "Give me the first five orders" has little meaning. You can't guarantee which rows the query will return; the results depend on the plan that the optimizer chooses to process the query and the data's physical layout on disk. You'd only want to use TOP without an ORDER BY clause when you want a partial number of sample rows and you don't care which rows you get.

TOP is neither in accord with the relational model nor ANSI-compliant. The relational model deals with sets, and ANSI SQL lets you manipulate those sets and limit the number of rows in a query's result by using logical conditions that define which rows to return. ANSI SQL contains no options that assume a particular physical data layout that can determine which rows to return.

According to the relational model, a table is a set of a particular entity's occurrences. Rows in a table are in no specific order. Thus, a query with an ORDER BY clause doesn't return a table. If you use the ORDER BY clause in a view or derived table, the query will fail. But SQL Server lets you use the ORDER BY clause in a TOP query for derived tables and views. In an upcoming article on sorting techniques, I'll show you how to overcome this apparent contradiction.

Although TOP queries aren't ANSI-compliant, they provide a solution to many problems that would be too performance-intensive if you wrote them in an ANSI-compliant manner. However, TOP queries can take you only so far. They can't solve all problems associated with limiting the number of rows in a result.

Getting Rows m to n

A TOP query can't get rows m to n in the order that the ORDER BY clause specifies. For example, suppose you want to return orders 6 to 10 from the Orders table; as you page through the orders in groups of five rows, you want to retrieve the second group of rows. You can't respond with a TOP query. However, this request has an ANSI-compliant solution (I use SELECT * for brevity; you should select only the required columns):

SELECT *
FROM Orders AS O1
WHERE (SELECT COUNT(*) FROM Orders AS O2
  WHERE O2.OrderID <= O1.OrderID) BETWEEN 6 AND 10
  ORDER BY OrderID

This query's poor performance is noticeable, especially with large tables. For each key (OrderID), the query counts the number of keys that are less than or equal to that key. If it's unique, the first key will have 1, the second 2, and so on. This query provides the result's ordinal position, or row number, and lets you limit the result to rows 6 to 10. You can improve performance by orders of magnitude with a TOP query; in my tests, the following query incurred about 1/15 of the previous query's performance cost:

SELECT *
FROM (SELECT TOP 5 *
  FROM (SELECT TOP 10 * FROM Orders ORDER BY OrderID) AS O1
ORDER BY OrderID DESC) AS O2
ORDER BY OrderID

The problem with this query is that it sorts the input three times. This approach can be very time-consuming if you page through the rows in groups, requesting a different group of rows each time.

To avoid the overhead of multiple sorts, you can store the rows and a one-time calculated row number in a temporary table and repeatedly select from the temporary table. If the base table doesn't already have an IDENTITY column, you can easily archive the table with a SELECT INTO query that uses the IDENTITY() function to calculate the row numbers. The following SELECT INTO query generates an error because it attempts to create a destination table with two IDENTITY columns. One column is the OrderID column that already has the IDENTITY property in the Orders table, and the other column is the result column of the IDENTITY() function.

SELECT IDENTITY(int, 1, 1) AS RowNum, OrderID, CustomerID, EmployeeID, OrderDate
INTO #Orders
FROM Orders
ORDER BY OrderID

If the base table already has an IDENTITY column, as the Orders table does, you can manually create a temporary table with an IDENTITY column and populate the table with an INSERT SELECT statement, as the following code shows:

CREATE TABLE #Orders
(
  RowNum     int      NOT NULL IDENTITY(1, 1)
  PRIMARY KEY,
  OrderID    int      NOT NULL,
  CustomerID nchar(5) NULL,
  EmployeeID int      NULL,
  OrderDate  datetime NULL
)
INSERT INTO #Orders(OrderID, CustomerID, 
  EmployeeID, OrderDate)
  SELECT TOP 10 OrderID, CustomerID, EmployeeID, 
  OrderDate
  FROM Orders
  ORDER BY OrderID

Now it's easy to return rows 6 to 10.

SELECT *
FROM #Orders
WHERE RowNum BETWEEN 6 AND 10

Tests that I performed using the SELECT INTO statement show that SQL Server 7.0 might produce undesired results, whereas SQL Server 2000 produces the desired results. The query execution plans show that SQL Server 7.0 sometimes calculates IDENTITY values before sorting the rows, so the row numbers become meaningless. SQL Server 2000 calculates IDENTITY values after the sort, providing the desired result. When I use an INSERT SELECT statement, both versions calculate IDENTITY values after the sort.

Paging Through the Result Rows

If you want to page through the sorted orders in groups of rows, you can store the whole set of orders with their row numbers in the temporary table, as the following code shows:

TRUNCATE TABLE #Orders
INSERT INTO #Orders(OrderID, CustomerID,
  EmployeeID, OrderDate)
  SELECT OrderID, CustomerID, EmployeeID,
    OrderDate
  FROM Orders
  ORDER BY OrderID

Then, use a simple stored procedure that accepts the page number and the number of rows in each page as parameters and calculates the range of desired row numbers to return:

CREATE PROC usp_GetOrders
  @page     int,
  @pagesize int
 AS
SELECT *
FROM #Orders
WHERE RowNum BETWEEN @pagesize * (@page - 1) + 1
  AND @pagesize * @page

For example, if you want the second page of five rows to get rows 6 to 10, you would invoke the stored procedure as follows:

EXEC usp_GetOrders
@page     = 2,
@pagesize = 5

Another way to page through the rows is to use a simple TOP n query to retrieve the first group of rows. For example,

SELECT TOP 5 *
FROM Orders
ORDER BY OrderID

You can save the value of the last row's column (by which you sort the result) in a variable and use the value in your next query. For example, the query above returned OrderID 10252 as the last order in the result. The following query will retrieve the next group of rows:

SELECT TOP 5 *
FROM Orders
WHERE OrderID > 10252
ORDER BY OrderID

If you're using ADO as your client database interface, you can use the built-in paging capabilities of its recordsets.

Getting TOP n Rows for Each Key

Some requests are a bit more complex than those above. Suppose you want to limit the number of orders for each customer—for example, you want only the three orders with the lowest OrderIDs for each customer. Dr. Tom Moreau, president of Brockman Moreau Consulting in Toronto, provided the following solution:

SELECT CustomerID, OrderID
FROM Orders AS O1
WHERE OrderID IN (SELECT TOP 3 OrderID
  FROM Orders
  WHERE CustomerID = O1.CustomerID
  ORDER BY OrderID)

Moreau's solution is a tricky way of using the TOP clause. You can't use a simple TOP query to request the TOP n rows for each customer, but you can ask for orders that fall within the TOP n orders for a particular customer. If you want to limit the number of orders for each customer but you don't care which orders the query returns, you don't need to specify the ORDER BY clause in the subquery.

You can also use the temporary table solution for this problem. You can store the orders sorted by CustomerID and OrderID in the temporary table with the autogenerated row numbers, as the following code shows:

TRUNCATE TABLE #Orders

INSERT INTO #Orders(OrderID, CustomerID,
  EmployeeID, OrderDate)
  SELECT OrderID, CustomerID, EmployeeID, OrderDate
  FROM Orders
  ORDER BY CustomerID, OrderID

Now, you can request the three orders in which the customer's row number is less than that customer's minimum row number plus 3:

SELECT O.CustomerID, O.OrderID
FROM
    #Orders AS O
  JOIN
    (SELECT CustomerID, MIN(RowNum) AS MinRowNum
    FROM #Orders
    GROUP BY CustomerID) AS M ON  O.CustomerID =
	  M.CustomerID
	  AND O.RowNum < M.MinRowNum + 3
ORDER BY O.CustomerID

Using a UDF to Get Rows m to n

You can use a user-defined function (UDF) to get rows m to n. In the same way that the temporary table solution does, the UDF populates a table variable with the table rows and the row number, which the IDENTITY column generates. Listing 1 shows the script that creates the ufn_GetOrders() function.

The function first stores all the rows from the Orders table in the @Orders table variable, then deletes all irrelevant rows. A more efficient way might be to store the TOP @lastrow rows in the table variable and then delete all rows where RowNum is less than @firstrow. The problem is that the n part of the TOP n query can't be a variable; it must be a constant. This problem usually occurs when you need a small number of rows in a large table, but the solution is nice when you're using small tables.

To invoke the UDF to get orders 6 to 10, issue the following query:

SELECT * FROM ufn_GetOrders(6, 10)
ORDER BY RowNum

Extending the Capabilities

This article shows you how to extend the capabilities of TOP queries and provide solutions beyond TOP to limit the number of rows in the result. Using these techniques will help you write better-performing code when you need to manipulate groups of rows.

Comments

Plain text