Indexing for Sort Performance

The right index can dramatically improve SQL Server's sort performance. Defining a clustered index on a sort column, for example, forces the database to store data in sorted order, letting you retrieve data without additional sorting. Note that SQL Server 7.0 and earlier releases let you create indexes only in ascending order, so if your query requires data in descending order, you'll probably have to perform additional sorting and use internal worktables to get data in the right order. However, SQL Server 2000 lets you create indexes in ascending or descending order.

SQL Server 7.0 performs a sort operation when you use the ORDER BY clause. SQL Server's query optimizer also might use a sort operation to process a query that uses GROUP BY, DISTINCT, or UNION. In contrast, you can use the FAST index hint to avoid sorting the data. This hint tells the SQL Server query optimizer to use a nonclustered index that matches the ORDER BY clause, thereby eliminating the need for a sort operation. Let's look at how SQL Server processes GROUP BY, DISTINCT, and UNION clauses to sort data, then explore how different indexing techniques can improve the performance of queries that need ordered data.

Sorting with GROUP BY

You use the GROUP BY clause to organize data into groups; all rows in a group have the same value for the column or columns you specify in the GROUP BY clause. For example, the query

SELECT customerid
  FROM orders
  GROUP BY customerid

sorts data into groups based on the customerid column. SQL Server then returns one row for each distinct column value. You can group a table by any combination of its columns. In pre-SQL Server 7.0 releases, SQL Server always sorts data before it forms the groups. But SQL Server 7.0 can use a technique called hashing instead of sorting to group columns.

Hashing, an alternative to indexes, provides fast access to particular table rows. During hashing, SQL Server uses a hash function—h()—to uniformly map each page into a hash table. The hash function identifies each entry in the hash table, which has a fixed number of buckets. A bucket is an array of pointers that serves as an index to a page. Each bucket contains index entries made up of two values: the value of the column on which the database builds the hash function and a pointer to the corresponding row. The hash function then maps each column value to a number and creates an index entry for each table row. (For more information about hash joins, see Itzik Ben-Gan and Kalen Delaney, "Advanced JOIN Techniques," December 1999.)

The SQL Server query optimizer decides whether to use sorting or hashing to form groups based on which yields the fastest performance. If you've defined a clustered index for the column that you're specifying in the GROUP BY clause, the query optimizer often uses sorting because the clustered index sorts all table rows physically on the disk. The optimizer also might use sorting when you've defined a nonclustered index for the GROUP BY column if this method provides the fastest processing route. However, if you haven't defined an index for the column in the GROUP BY clause, the query optimizer might use hashing instead. SQL Server also provides two query-processing hints—HASH GROUP and ORDER GROUP—that you can use to control the GROUP BY operation. The HASH GROUP hint forces the optimizer to use hashing to form the groups; the ORDER GROUP hint forces the optimizer to use sorting.

When you use the GROUP BY clause with all pre-SQL Server 7.0 releases, you receive query results in the order of the GROUP BY columns. But SQL Server 7.0's output order depends on the query optimizer's chosen (or forced) grouping technique. If the optimizer uses sorting, SQL Server displays ordered output. If the optimizer uses hashing, the database displays the table's rows as they appear in the hash table, which might or might not be the order in which you expected to see results. By definition, without an ORDER BY clause, you have no inherent order to data in a relational database, so to guarantee ordered output, use the ORDER BY clause.

DISTINCT and UNION

You use the DISTINCT clause to eliminate duplicate values in a column. As with the GROUP BY clause, pre-SQL Server 7.0 releases always sort the data to eliminate duplicates. But the SQL Server 7.0 optimizer might use sorting or hashing, depending on cost, to eliminate duplicates.

Let's look at some examples to see how the query optimizer processes the DISTINCT clause under different conditions. First, you need to run the setup code in Listing 1 to create the example orders table that we use throughout the rest of this article. Our orders table is similar to Northwind's orders table except that we increased our table to 100,000 rows to simulate a production environment for performance and optimization tests.

You can now run the following query, which sorts table rows based on the SELECT statement's DISTINCT clause and a nonclustered index on the orderid column (make sure you turn on the Show Execution Plan option in Query Analyzer's Query Menu so you can see the optimizer's plan):

Use Northwind
select distinct orderid
   from orders
   where customerid = 'WHITC'

Screen 1 shows that the optimizer uses sorting to process this query. If you don't use the index on orderid, the same query uses hashing instead of sorting to get its results. For example, if you run the query without the WHERE clause, the optimizer uses an execution plan like the one in Screen 2.

The UNION operator combines two tables in a single result set that contains all rows appearing in either or both tables. UNION processing is different depending on whether you use the ALL option. If you specify ALL, the optimizer displays all resulting rows, including duplicates. If you don't specify ALL, the optimizer processes UNION the same way it processes the DISTINCT clause, removing all duplicate rows.

Using Indexes to Optimize Sorts

Now, let's look at how different indexes can improve the performance of queries that need ordered data. The following query retrieves all orders' rows with the customerid value WHITC and uses the ORDER BY clause to sort the results by the orderdate column:

Use Northwind
select orderid, customerid, orderdate
   from orders
   where customerid = 'WHITC'
   order by orderdate

To create a nonclustered index on the customerid column, use the statement

create index i_customerid on orders (customerid)

Screen 3 shows the query's execution plan after you create the nonclustered index on customerid. The query execution plan shows that the query optimizer first accesses the existing index, then sorts the resulting set of rows. Can you improve the query's performance by creating a second nonclustered index on the sorted column? Or is it better to define a composite index—an index you build on more than one column—instead of two separate indexes? To find the answer, let's look at two more query examples.

The following statement creates a second, separate, nonclustered index on the sorted orderdate column:

create index i_orderdate on orders (orderdate)

Screen 4 shows the query's execution plan after you add the second, separate index; you can see no difference between this execution plan and that for the query with just one index in Screen 3.

Creating a composite nonclustered index for the query, however, yields significant performance improvement. To create a composite index on customerid and orderdate, issue the following statement:

create index i_custom_order on orders (customerid, orderdate)

Screen 5 shows the execution plan for the query after you create the composite index. This execution plan shows two advantages over the previous plans. First, the query performs only two instead of three operations to get its results (the SELECT statement doesn't count as an operation). Second, the optimizer doesn't use sorting, which is usually a costly operation. The optimizer doesn't need to sort because the composite index i_custom_order provides row access, satisfying the query condition (customerid = 'WHITC'), and orders the result set according to the orderdate column.

Two Better Options

By eliminating the need for a sort, a composite index reduces the number of operations the query optimizer performs. But two other indexing techniques—using a covering index and creating a clustered index for the column in the ORDER BY clause—can improve query performance even more.

A covering index, which is a nonclustered index on all the columns required to satisfy a query, improves query performance by accessing only the index's b-tree structure. With a covering index, index entries are smaller than entries for the corresponding rows, so you need significantly fewer I/O operations to find the result set. The following statement shows how to create a covering index on the customerid, orderdate, and orderid columns:

create index i_custom_orderdate_id on orders 
   (customerid, orderdate, orderid)

Screen 6 shows the corresponding execution plan for the query after you create the covering index. As you might expect, the covering index reduces the query's execution plan to only one operation, omitting both the sort operation and the bookmark search for each row. (For a detailed discussion of covering indexes, see "New Features for Query Processing," July 1999.)

Creating a composite clustered index on customerid and orderdate sorts orders' rows—first by customerid's values, then by orderdate's values. And the existence of the composite clustered index maintains the rows in sorted order. With this index, the optimizer must find only the first row that satisfies the query condition (customerid = 'WHITC'); it can then read and display the corresponding rows without having to sort them. To create a composite clustered index on customerid and orderdate, issue the following statement:

create clustered index c_custom_order on orders 
(customerid, orderdate)

Screen 7 shows the query's execution plan after you create the composite clustered index. The optimizer again has to perform just one operation instead of two.

Comparing the Results

Table 1 shows the impact on sort performance for each indexing method we've discussed. As you can see from the average execution times, composite clustered indexes and covering indexes made the best showing in our tests. And by defining the appropriate composite nonclustered index, you can also omit the sort operation and boost query execution. If you have queries that need ordered data, invest some time in investigating which indexes work best in your environment. You'll reap the rewards in significantly faster query performance.

Comments

Plain text