SQL Server 2005 introduces two enhancements to the TOP clause. Now you can specify an expression as an input to the TOP keyword, and you can also use TOP in modification statements (INSERT, UPDATE, and DELETE). Here, we'll look at sample uses of these enhancements.
Using TOP with Expressions
In SQL Server versions before SQL Server 2005, the TOP option supported only a constant as input. In SQL Server 2005, TOP can accept any self-contained (i.e., not correlated) expression in parenthesis as input. The expression can be as simple as a variable name or as complex as a self-contained subquery, and you must specify it in parenthesis. For example, the following syntax is valid:
USE AdventureWorks; DECLARE @n AS INT; SET @n = 3; SELECT TOP(@n) * FROM Sales.SalesOrderHeader ORDER BY SalesOrderID;
The following syntax isn’t valid:
DECLARE @n AS INT; SET @n = 3; SELECT TOP @n * FROM Sales.SalesOrderHeader ORDER BY SalesOrderID;
For backward compatibility, SQL Server 2005 will still support specifying constants in SELECT queries without parenthesis, but I suggest you start using the new syntax so you can get used to it.
The ability to use expressions is handy when you want to limit the number or percentage of returned rows according to input arguments supplied to a routine. For example, suppose you have a stored procedure or a function that accepts two arguments: @pagesize and @numpages. @numpages represents the requested number of pages; @pagesize represents the number of rows in a page. The routine is supposed to return the first @numpages of rows from the Sales.SalesOrderHeader table in the AdventureWorks database, sorted by SalesOrderID. The query to do so looks like this (using local variables to test the code):
DECLARE @pagesize AS INT, @numpages AS INT; SET @pagesize = 5; SET @numpages = 3; SELECT TOP(@numpages*@pagesize) SalesOrderID, CustomerID, OrderDate FROM Sales.SalesOrderHeader ORDER BY SalesOrderID;
If you also want to return the page number, you can do so by using the ROW_NUMBER function, as in the following example. (I discuss using the ROW_NUMBER function in the Web-exclusive articles "Ranking Functions," May 2004, InstantDoc ID 42646 and "Calculating Row Numbers in SQL Server 2005," April 2004, InstantDoc ID 42302.)
DECLARE @pagesize AS INT, @numpages AS INT; SET @pagesize = 5; SET @numpages = 3; SELECT TOP(@numpages*@pagesize) (ROW_NUMBER() OVER(ORDER BY SalesOrderID) - 1) / @pagesize + 1 AS page, SalesOrderID, CustomerID, OrderDate FROM Sales.SalesOrderHeader ORDER BY SalesOrderID;
(row_num - 1) / @pagesize + 1
calculates the target page number. Table 1 shows the output for the previous query.
As I mentioned earlier, the expression you use as the input to the TOP option can even be a self-contained subquery. To see an example of using TOP in this way, first run the code in Listing 1, which creates the table T1 with the numeric column col and populates it with five values. Suppose you need to calculate the median of the values in T1. If the number of rows in T1 is odd, you’re supposed to return the middle value. If the number of rows in T1 is even even, you’re supposed to return the average of the two middle values.
By running the following query, you return all values up to and including the middle value(s):
SELECT TOP(SELECT COUNT(*)/2 + 1 FROM T1) col1 FROM T1 ORDER BY col1;
For an odd number of rows, the expression
returns the ceiling of half the number of rows. For example, if the table has five rows, you’ll get the lowest three back. The last is the median. For an even number of rows, this expression returns half the number of rows plus 1. For example, if the table has six rows, you’ll get the lowest four. The average of the last two is the median.
To extract only the values that need to participate in the median calculation, create a derived table out of the previous query (call it D1), and use an outer query to extract the correct number of values, like this:
SELECT TOP(SELECT 2 - COUNT(*)%2 FROM T1) col1 FROM (SELECT TOP(SELECT COUNT(*)/2 + 1 FROM T1) col1 FROM T1 ORDER BY col1) AS D1 ORDER BY col1 DESC;
2 - COUNT(*)%2
produces 1 for an odd number of rows in the table and 2 for an even number. Because the outer query reverses the order of rows from the derived table D1, you get the highest within D1. This means that you get the middle value in an odd case and the two middle values in an even case.
Finally, create a derived table from the previous query (call it D2), and calculate the average of the values from D2 in the outer query, like this:
SELECT AVG(col1) AS median FROM (SELECT TOP(SELECT 2 - COUNT(*)%2 FROM T1) col1 FROM (SELECT TOP(SELECT COUNT(*)/2 + 1 FROM T1) col1 FROM T1 ORDER BY col1) AS D1 ORDER BY col1 DESC) AS D2;
The query returns 3.00 for the sample data I used to populate T1 because the table has five rows and 30.00 is the middle point. Add another row to the table, and rerun both queries, like this:
INSERT INTO T1 VALUES(700.00); SELECT TOP(SELECT 2 - COUNT(*)%2 FROM T1) col1 FROM (SELECT TOP(SELECT COUNT(*)/2 + 1 FROM T1) col1 FROM T1 ORDER BY col1) AS D1 ORDER BY col1 DESC; SELECT AVG(col1) AS median FROM (SELECT TOP(SELECT 2 - COUNT(*)%2 FROM T1) col1 FROM (SELECT TOP(SELECT COUNT(*)/2 + 1 FROM T1) col1 FROM T1 ORDER BY col1) AS D1 ORDER BY col1 DESC) AS D2;
Now you get 35.00 as the median because the table has an even number of rows and 30.00 and 40.00 are the two middle values; 35.00 is the average of the two.
Using TOP with Modifications
You can also now use TOP in statements that modify data, including INSERT, UPDATE, and DELETE statements. Be aware that, unlike SELECT statements, you can't specify an ORDER BY clause in modification statements. SQL Server 2005 will modify the first n qualifying rows that are physically accessed in a query. This method is useful when you don’t care which qualifying rows are modified. Examples of scenarios in which the order of modifications isn’t important are purging historic data in batches or applying large updates in batches.
To demonstrate how to use TOP with modifications, first run the following code, which creates an empty copy of the schema of AdventureWorks.Sales.SalesOrderDetail in tempdb and names the new table BigTable:
SET NOCOUNT ON; USE tempdb; SELECT SalesOrderID, SalesOrderDetailID + 0 AS SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, LineTotal, rowguid, ModifiedDate INTO BigTable FROM AdventureWorks.Sales.SalesOrderDetail WHERE 1=2;
The SalesOrderDetail table currently contains 121,317 rows. Suppose you want to load test data into BigTable by performing a UNION ALL between several copies of SalesOrderDetail. However, you want to end up with exactly 500,000 rows. You don’t care about the order. To achieve this, you specify TOP(500000) in the INSERT statement, like this:
INSERT TOP(500000) INTO BigTable SELECT * FROM AdventureWorks.Sales.SalesOrderDetail UNION ALL SELECT * FROM AdventureWorks.Sales.SalesOrderDetail UNION ALL SELECT * FROM AdventureWorks.Sales.SalesOrderDetail UNION ALL SELECT * FROM AdventureWorks.Sales.SalesOrderDetail UNION ALL SELECT * FROM AdventureWorks.Sales.SalesOrderDetail;
If you specified the TOP option in the first SELECT query, doing so would have affected only the first SELECT’s rows and not the result of the UNION ALL activity.
Before I demonstrate using TOP to purge data in batches, let's first create a clustered index on the ModifiedDate column in BigTable:
CREATE CLUSTERED INDEX idx_ModifiedDate ON BigTable(ModifiedDate);
Suppose you need to purge all data where ModifiedDate is earlier than 2003. You can imagine that in a production environment purging historical data can affect millions of rows and take a very long time. If you run a DELETE statement to achieve this task, it will run as a single transaction. Locks might escalate to a table lock, and the space that the transaction consumes in the transaction log can't be freed until the transaction finishes. To avoid lock escalation and to free transaction-log space, you can purge the historical data in batches by using the TOP option in the DELETE statement, like this:
DECLARE @BatchSize AS INT; SET @BatchSize = 5000; WHILE 1 = 1 BEGIN DELETE TOP(@BatchSize) FROM BigTable WHERE ModifiedDate
Each DELETE is automatically considered as a separate transaction. As soon as the DELETE affects less than @BatchSize rows, the last batch has been deleted, and the loop breaks. When the purging code finishes, run the following query and verify that 387,252 rows remain in the table:SELECT COUNT(*) FROM BigTable;
Similarly, you can issue a large UPDATE by using the TOP option to update data in batches.
If the order of modification is important to you, you can use a common table expression (CTE) that filters the relevant rows by using a SELECT query with the TOP option, then delete the data from the CTE. For example, say you want to delete the oldest 5000 rows that have a ModifiedDate earlier than 2003 from BigTable. To achieve this, run the code in Listing 2. Note that to test the code and see that it really works correctly, you should repopulate BigTable using the code I provided earlier. The Batch CTE isolates the oldest @BatchSize rows that have a ModifiedDate earlier than 2003 from BigTable, according to a ModifiedDate sort. The outer DELETE statement deletes all rows isolated by the CTE.
Better than SET ROWCOUNT
Before SQL Server 2005, programmers and DBAs typically used the SET ROWCOUNT session option to limit the number of affected rows. The advantages of using this session option instead of the TOP option were that SET ROWCOUNT allowed a variable as an input and also affected modifications. In SQL Server 2005, the TOP option supports an expression as input and also works with modifications, so you no longer need to use the SET ROWCOUNT option. When generating an execution plan, the optimizer can rely on the TOP option’s input to determine which access method to use. Queries that use TOP can potentially perform better than those that use the SET ROWCOUNT option because with SET ROWCOUNT the optimizer doesn't “sniff” the input.
In the next version of SQL Server after SQL Server 2005, Microsoft will no longer support the use of SET ROWCOUNT in modification queries. Thus, it’s a good practice to start replacing code that uses SET ROWCOUNT with code that uses TOP, so you can start reaping the benefits of Microsoft's enhancements to TOP.