We all want our SQL Server applications to run as fast as possible. And query tuning—making sure SQL Server can process each query your application executes as quickly as possible—is an area in which you can make a significant impact on overall application performance. For the past 14 months, I've devoted this column to the topic of query tuning for optimum performance, paying particular attention to choosing the best indexes for the types of queries your applications are performing. This month, I conclude the series by summarizing some important points to keep in mind when you're designing the best indexes for your queries.
The most important piece of advice I can give you to start with is that you can't design indexes in a vacuum. You need to know the kinds of queries your users will be running. Without that information, you have no way to know which indexes might be most helpful. To determine the kinds of queries your users will run, you need to ask the following questions about each of your tables:
- Which columns in the table will most likely be used as search arguments (that is, which columns will appear in conditions in WHERE clauses)?
- Will queries access a large percentage of rows in the table or only a few rows (or one)?
- Which columns will SQL Server need to return to the client application most often?
- Will queries use GROUP BY or TOP to summarize the data?
- Will SQL Server need to return the results to the client application in a particular order?
- Will the table be used for SELECT queries, data-modification queries, or both?
- How many rows will be updated at a time?
- Which columns will be updated frequently?
- Which columns will be used in joining tables?
Determining the answer to question 1 will help you find the answer to question 2. When you know which columns will be used as search arguments, you can execute a simple GROUP BY query with a COUNT(*) aggregate on any column that's used as a search argument to determine the distribution of each possible data value. For example, if a query containing a WHERE clause that references the customerID column will frequently access your orders table, you could run the following query to check the distribution of customerID values:
SELECT customerID, orders = COUNT(*), percent_of_total = COUNT(*) * 100.0/ (SELECT COUNT(*) FROM orders) FROM Northwind.dbo.orders GROUP BY customerID ORDER BY 3 DESC
The above query returns 89 rows—one for each customerID that appears in the orders table. The second column contains the number of times that customer appears in the table (the number of orders placed), and the third column contains the percentage of the total orders that the second column number equals. Because I included an ORDER BY DESC clause for the third column, the query returns the customers who have the largest percentage of the total orders first.
Figure 1 shows the first row this query returns. This result means that the most active customer has 31 orders, or 3.7 percent of the total number of orders. No customer has an excessive number of orders (e.g., 500), so this experiment tells you that no query searching for a single customerID value in the WHERE clause will return a large percentage of rows. And as you know, you can consider nonclustered indexes for columns used in search arguments in queries that access only a small percentage of the table.
Big on Indexes
Knowing whether a query will return a large percentage of rows helps you determine how and when a nonclustered index will be useful. I discussed one special case of nonclustered indexes—the creation of covering indexes—in my August 2001 ("Time for a Tune-Up") and September 2001 ("The Big Cover-Up") query-tuning columns. Knowing the answer to question 3 is important in determining whether a covering index is possible. Unlike noncovering indexes, which need to include only the columns in the query's WHERE, ORDER BY, or GROUP BY clauses, a covering index must include all the columns in a query, including columns to be returned in the SELECT list.
If you have crucial queries that return a large percentage of rows in a table (perhaps more than 5 or 10 percent) and include too many columns to be good candidates for covering indexes, consider putting a clustered index on the columns in the query's WHERE clause. One type of query that might fall into this category is a query based on a range of data values. For a query that requests all information about all customers whose ZIP codes are between 80000 and 89000, you could put a clustered index on the ZIP code column. You must take great care in choosing the clustered index for a table, though, because you only get one such index. (You can have up to 249 nonclustered indexes on a table, depending on the number of non-index statistics you have.)
Clustered indexes are useful for queries that return a large number of rows or a range of values, but they're also useful for the queries you identified in questions 4 and 5, which perform grouping or sorting operations. If rows need to be returned to the client in a particular order or grouped and summarized based on the values in a particular column, a clustered index on the ORDER BY or GROUP BY column could boost performance. If the GROUP BY query returns data based on only one or two columns, you could add a nonclustered index on those columns, but if many columns in the table need to be summarized, a clustered index might be a better choice.
If the answers to questions 1 through 5 give you multiple possible choices for a clustered index, you can look at some other factors to help you decide how to index. For example, instead of building a clustered index on a column that determines the sort order, you can return the data unsorted and have the application perform the sort.
Besides knowing the types of queries that your users will run, you need to know the relative priority and frequency of the queries. If one query is run every time a customer calls for information about an account, that query might have higher priority than a query that generates a report overnight once a month. You might want to let the frequent or high-priority queries determine the indexing on a table.
Is the Data Changing?
If the answer to question 6 is that a table is mainly used for updating, you can look at questions 7 and 8 to refine the types of update operations you'll be performing. For tables used primarily for retrieval, questions 7 and 8 won't be much help. But if the answer to question 7 tells you that most of the rows will be updated all at once, you have a bulk operation. For bulk queries, you have to decide whether to have any indexes on the table while the batch operation is running. (See my August 2002 column, "Tame Bulky Inserts," InstantDoc ID 25521, for index considerations for bulk-modification operations.)
If the answer to question 7 tells you that only a small percentage of rows in a table will change, look at the search arguments in the relevant queries. Consider indexing the queried columns in the same way you would index search-argument columns for a SELECT query. Also, keep in mind that all data-modification operations can require some index maintenance. You can minimize that maintenance by finding the answer to question 8, determining which columns will be updated, and not indexing those columns. You can also minimize index maintenance by making sure that the updated columns aren't variable-length columns, which can increase in size during modification and cause the row to move to a new page. (You can find the data type of existing table data by using the sp_help command.)
My general recommendation for tables that are mainly used for data modifications—such as an orders table, in which inserts are occurring continuously—is to have as few indexes as possible. You want to have enough indexes to help you find the data rows to modify, but you want to limit the number of indexes because data modification causes index maintenance.
For tables that are mostly used for data retrieval, the number of nonclustered indexes you can have is limited only by the available disk space. Because data-retrieval indexes don't need maintenance, consider creating any index that can help.
If a table is used for both retrievals and data-modification operations, you need to find a balance between having enough indexes to find the data you need and having so many indexes that maintaining them becomes costly. However, rather than try to design a set of indexes that are good for both retrieval and modification operations, you'll probably have better overall performance if you choose either retrieval or modification and design your indexes as if that's the only type of operation you'll be doing.
If a table is used for both modification and retrieval, you need to be concerned with one other factor: data fragmentation. In a table that's used only for retrieval, fragmentation won't occur, and in a table that's used only for modification of a few rows at a time, fragmentation generally won't matter. But if you perform both kinds of operations, the modifications will create fragmentation, which then slows down any retrievals that return multiple rows in a range or in sorted sequence. In such a case, you'll need to rebuild or defragment your indexes regularly. (For more information about fragmentation and defragmenting indexes, see my article "Keep SQL Server Up and Running," December 2000, InstantDoc ID 15742.)
The answer to question 9, finding which columns are used for JOIN operations, will give you further possibilities for indexing because indexes on your join columns can often improve performance. However, indexing for JOIN operations is a complex topic, which I won't review here. You can take a look at my Inside SQL Server columns in March 2002 through May 2002, in which I discussed JOIN operations and the choice of indexes to support joins.
Heaps Need Not Apply
Reviewing your tables and queries to determine the answers to the nine questions I pose in this column will help you become intimately familiar with the kinds of queries your users execute—and will help you choose the right indexes for those operations. In general, I recommend that you give every table a clustered index unless you have a compelling reason not to. Remember that SQL Server uses the clustered index key as the bookmark in every nonclustered index's leaf-level rows, and thus having a clustered index is like having an extra column in the nonclustered index. For retrieval operations, you can easily end up with many more covered queries than you planned for. For INSERT and UPDATE operations, having a clustered index avoids the overhead involved with forwarding pointers, as I discussed in my July 2002 column ("Indexing Updates," InstantDoc ID 25198). Try to keep your clustering key as short as possible and place it on a static column or columns.
By now, you're aware that query tuning is a huge topic. I hope I've given you enough information about the way SQL Server organizes its data and chooses the indexes to use that you can start designing your own indexes. Alternatively, consider using the Index Tuning Wizard (ITW) to get initial suggestions about the best indexes, but don't take those recommendations as the final word. Use your own understanding of your applications and your data to fine-tune those suggestions after testing them thoroughly. The ITW doesn't consider multiuser concurrency and possible blocking problems, nor does it consider indexes on temporary tables, to name just a couple of its limitations. The more representative a sample of actual user queries you can supply as input, the better the ITW's recommendations will be. But again, don't take what the Tuning Wizard tells you as final because testing and tuning is a never-ending process.