By using simple tuning techniques, you can make a big difference in the performance of certain queries. For example, the technique I demonstrated in "Are You in Tune?" July 2001, involved expressing in the form of a SARG any conditions in a query's WHERE clause that referenced indexed columns. A SARG is a search condition, expressed in a particular format; for details, see last month's column. Let me demonstrate how you can verify what I told you about SARGs and indexed columns. Then, let's explore ways that you can tell whether SQL Server is using an index and whether using the index is useful.
Northwind's Strange Index Design
To begin, let's look at some examples from the Northwind database, which automatically installs with SQL Server 7.0 and later. (If you've experimented with the Northwind database, you might want to rebuild it so that you're working with the same data and table structures that I am.) You might have noticed that Northwind has some problems with its index design. If you use the sp_helpindex procedure to examine the existing indexes on the existing tables, you'll see that in some cases, SQL Server has created multiple indexes of the same type on the same columns. Nothing in SQL Server prevents this duplication; the only restrictions are that you can have only one clustered index and that a table can't have two indexes with the same name. But you can create two indexes with different names on the same column. For example, look at the indexes on the orders table by executing the following statement:
EXEC sp_helpindex orders
In the output, note that both the EmployeeID column and the CustomerID column have two nonunique nonclustered indexes. Although SQL Server allows this duplication, you get no benefit from multiple indexes on the same set of columns. In fact, if you're modifying data in the table, you actually incur a penalty because SQL Server needs to maintain all the indexes to reflect the data modifications. I have no idea why the Northwind database has this strange set of indexes on its tables, but I want you to be aware of it.
Now let's look at the following query:
-- Query 1 USE Northwind SET STATISTICS IO ON SET STATISTICS PROFILE ON SELECT * FROM orders WHERE customerID = 'CENTC'
This query uses a SARG on the CustomerID column, which has a nonclustered index. You can use either of two SET options to find information about the use and benefit of an index. If you use SET SHOWPLAN_TEXT ON before you run the query, SQL Server reports the execution plan it used, but it doesn't actually execute the query. To execute the query and see the results, you have to use SET SHOWPLAN_TEXT OFF. You can use SET STATISTICS_IO ON to see the I/O cost of running the query, but you won't get any information about statistics if SHOWPLAN_TEXT is on, because the query won't run. Therefore, SHOWPLAN_TEXT and STATISTICS IO are mutually exclusive.
Now, turn on another option called STATISTICS PROFILE. This option lets the query execute and shows the execution plan—and you can use it in conjunction with STATISTICS IO.
You might be aware that SQL Server has a utility for looking at execution plans in a graphical format. (Using Query Analyzer, choose the Display Estimated Execution Plan option from the Query menu.) I prefer using the graphical plan for complex queries. But for these simple queries—for which I want to copy a portion of the plan into this article, for example—the text-based plan is useful.
The first detail to note in the STATISTICS IO results from Query 1 is that SQL Server performed four logical reads, meaning that it had to access four pages of data to find the information it needed to return one row of output. The value returned for physical reads indicates how many of those pages SQL Server had to read from disk, but I won't deal with that value right now. Other factors besides the use or non-use of indexes can impact the physical reads. In the STATISTICS PROFILE output, I'm interested only in whether and how SQL Server used an index. An index seek means that SQL Server used the index in a vertical manner, starting from the root and traversing through the levels of the index to find the data it needed. An index scan means that SQL Server used the index in a horizontal manner, scanning all or part of the leaf level of the index.
For Query 1, the relevant output from the plan is
This line means that SQL Server used a seek operation on the nonclustered index CustomerID to find the relevant row.
Now, let's compare Query 1 with a similar query that doesn't contain a SARG. To run this query, make sure that you stay in the same Query Analyzer window or that the same SET options are on:
-- Query 2 SELECT * FROM orders WHERE substring(customerID, 1,3) = 'CEN'
This query returns the exact same row of output as Query 1 but at a higher cost. This query needed 21 logical reads, and the plan shows the following line:
|--Clustered Index Scan(OBJECT:(\[Northwind\].\[dbo\].\[Orders\].\[PK_Orders\])
Because the leaf level of a clustered index is the data itself, scanning a clustered index is the same as scanning the table. SQL Server is basically performing a table scan here, reading every page of the table, because it isn't using an index to help find the necessary row. Because the query has no SARG involving the CustomerID column, the optimizer doesn't even consider using an index.
In "Are You in Tune?" I demonstrated that in some cases, you could easily convert certain non-SARG expressions into SARGs; this is one such case. Query 2 is looking for customer ID values that start with the letters CEN. Using a SARG, you can rewrite the query as
-- Query 3 SELECT * FROM orders WHERE customerID LIKE 'CEN%'
The logical reads and the execution plan are the same for this query as for Query 1.
Similar but Different
All three of the previous queries returned the same row of results. The following query, involving an equality comparison, looks similar to Query 1. However, if you run it with the two SET options on, you'll notice significantly different behavior:
-- Query 4 USE Northwind SET STATISTICS IO ON SET STATISTICS PROFILE ON SELECT * FROM orders WHERE customerID = 'ERNSH'
For this query, 30 rows in the orders table satisfy the condition in the WHERE clause, and the number of logical reads is 21, the same as in Query 2. The optimizer decided to scan the clustered index in this case. If you think that the optimizer should have used the nonclustered index, you can try overriding the optimizer by using an optimizer hint. The word "hint" is misleading when applied to SQL Server queries. Usually, a hint is a gentle suggestion of something to consider; however, a SQL Server query hint is actually a directive. When you use a hint in a query, the optimizer has no choice but to comply. I recommend using hints for testing purposes only. The main reason for using them is just what I'm doing here—verifying that SQL Server's optimizer did indeed make the right choice.
Here's Query 4 rewritten, this time using a hint to force SQL Server to use the index that it used in Query 1:
-- Query 5 SELECT * FROM orders (index = CustomerID) WHERE customerID = 'ERNSH'
The execution plan shows that SQL Server obeyed the hint and used the nonclustered index on CustomerID, but the logical reads value of 72 shows that this solution wasn't the best one. A nonclustered index has only a pointer to a data row in its leaf level; for each data value that meets the condition in the WHERE clause, SQL Server must follow that pointer to the table itself. The execution plan output refers to this operation as the bookmark lookup operation (the pointers in the leaf level of the nonclustered index are the bookmarks).
The optimizer's job is to decide whether following all the bookmarks for multiple rows or just scanning every page in the table will be more efficient. In this example, following all the bookmarks from the nonclustered index to the data cost 72 page scans, whereas scanning the whole table cost only 21. Apparently, the optimizer made the right decision. In general, using a nonclustered index is useful only if the query will return a very small percentage of rows. I know of no absolute rule about what percentage of rows causes the optimizer to choose a table or clustered index scan instead of seeking through a nonclustered index. Query 4 returns 30 rows, which is about 3 percent of the 830 rows in the table. If I run a similar query, looking for the name 'WELLI' instead of 'ERNSH', the nine rows returned are about 1 percent of the table and the optimizer chooses the nonclustered index seek. I estimate that the cutoff is at about 1 percent of the rows, but it depends on how many rows are on each page and the size of the index key compared with the size of the full rows. You can get a better feel for this cutoff point by running some tests to determine how often, and for what kinds of values, the optimizer chooses a nonclustered index.
When Is a Nonclustered Index Like a Clustered Index?
In most cases, you can easily see where a clustered index would be useful, because a clustered index controls the order of storage of the data itself. Because SQL Server stores the data in the order of the clustered index's key columns, clustered indexes can be a big help if you frequently need to sort the data. If the data is already in the right order because of a clustered index, you don't need any additional sorting during query execution. And if you use a clustered index seek to find the data, SQL Server doesn't need to perform any additional bookmark lookup because when SQL Server finds the index keys it needs, it's found the entire row with all the data columns.
Clustered indexes can be useful when you're looking for a range of data—for example, you want to find all the customers whose postal code is between 80000 and 89999. If the postal code column has a clustered index, all the appropriate rows are stored close together in the table. The clustered index can be useful even if you're interested in a large percentage of rows in the table.
One of the biggest problems in choosing the best indexes is that you can have only one clustered index on a table. You can have up to 249 nonclustered indexes, but in general, those indexes are useful only if you're going to return a tiny percentage of the rows in the table.
However, SQL Server offers one additional indexing technique that makes a nonclustered index behave like a clustered index. I'm interested in two specific benefits of clustered indexes. First, after SQL Server finds the correct index key, it doesn't have to follow any pointers for additional lookups. Second, SQL Server contiguously stores all the data that meets a certain criteria. You can get both of these benefits with nonclustered indexes if you make the nonclustered index a covering index. With a covering index, all the data necessary to satisfy the query is part of the index keys and resides in the index's leaf level. A covering index should include not only the columns you reference in the WHERE clause but all columns you reference in the SELECT list, as well as any columns in the GROUP BY or ORDER BY clauses.
For example, say you want to find all the CustomerID values that start with the letter C, so you use the following query:
-- Query 6 USE Northwind SET STATISTICS IO ON SET STATISTICS PROFILE ON SELECT DISTINCT customerID FROM orders WHERE customerID LIKE 'C%'
All you're interested in is the CustomerID data, and every CustomerID value is stored in the CustomerID index's leaf level. The CustomerID index is thus a covering index for the query. SQL Server can perform an index seek operation to find the first value that starts with C, then proceed forward though the index leaf level because all the other C values are contiguous. So, SQL Server doesn't need to access the data at all.
Even though Query 6 returns more result rows, the logical reads value indicates that this query takes even fewer reads than Query 1 because SQL Server doesn't have to access the data pages. Also, because all the CustomerID values are in order in the nonclustered index's leaf level, SQL Server can perform the DISTINCT operation quite efficiently. Every time SQL Server encounters a new value while scanning the CustomerID values that start with C, it will be the first occurrence of that value. So SQL Server can return the distinct values by returning only the new values it encounters during its scan of the leaf level.
Query 6 is an example of a covered query—one that is completely satisfied by a covering index. Queries involving simple aggregates also can frequently use covering indexes. In Query 7 below, I want to know how many rows have a CustomerID value of 'ERNSH'. Because every CustomerID value is in the index's leaf level, you can count the relevant rows right in the index and never access the data itself, the logical page accesses value is very small, and the query runs very quickly:
-- Query 7 SELECT COUNT(*) FROM orders WHERE customerID = 'ERNSH'
If you've already built a clustered index on the table, covering indexes are a wonderful feature that can help you speed up a query that returns many rows. Next month, I'll look at how SQL Server uses statistics to determine which index is the best to use when executing a query.