Statistically Speaking

Downloads
22075.zip

One of the most important aspects of query tuning is making sure that appropriate indexes are available. In "Are You in Tune?" July 2001, I defined the physical differences between the two kinds of indexes: clustered and nonclustered. In "Time for a Tune-Up," August 2001, and "The Big Cover-Up," September 2001, I showed how you can use indexes to make your queries perform more efficiently. To fully understand indexes and query tuning, you need to consider one more important aspect of their relationship: How does SQL Server decide which index, if any, to use for optimum query performance?

For example, let's look again at Query 1, which I've used in previous articles:

-- Query 1:
USE Northwind
SELECT * FROM orders
WHERE customerID = 'CENTC'

Because only a few rows satisfy this query's search argument, the optimizer chooses a nonclustered index for a seek on the customerID column. If you change the customerID value you're looking for to ERNSH, the optimizer chooses to scan the whole table because many more rows have that value. The query plan says that SQL Server is performing a clustered index scan, but because the leaf level of a clustered index is the data itself, scanning the clustered index is equivalent to scanning the whole table.

Without running the queries, how did the optimizer know that only a few rows would satisfy the request for customerID = 'CENTC' and many rows would satisfy the request for customerID = 'ERNSH'? The optimizer uses index statistics to get the information it needs to determine whether an index is useful. You can think of statistics as a histogram consisting of samplings of values for the index key (or the first column of the key, for a composite index) based on the current data. SQL Server stores these histograms in the sysindexes table's statblob field, which is an image data type. I discussed the organization and use of statistics for SQL Server 7.0 in "SQL Server Statistics: A Useful Query Optimizer Tool," August 1999. Microsoft changed the internal organization of statistics for SQL Server 2000, but the principles are the same, so I won't go into detail in this article. Just be aware that the index statistics tell the optimizer approximately how many times a particular value in a search argument appears in the data. In general, if the number of occurrences is very small, a nonclustered index might be useful. If the optimizer estimates that the number of occurrences of a value is so large that seeking through the nonclustered index and following the pointers for every qualifying data row will be more expensive, the optimizer chooses to have SQL Server scan the whole table.

By examining the query plan, you can see the optimizer's estimate of the number of qualifying rows. When you use the graphical plan in Query Analyzer, you can put your mouse pointer over the icon for the index SQL Server will use, and a yellow pop-up information box shows an estimated row count value. If you're using SET SHOWPLAN_ALL ON to look at the query plan, EstimatedRows is one of the result columns.

In the query plan for Query 1 with either 'CENTC' or 'ERNSH' as the search value, the optimizer's estimate is exactly correct. The optimizer's estimate is usually this accurate only when the table is very small or the number of distinct values is very small. The Orders table contains only 89 distinct customerID values, and because statistics contain detailed information about up to 200 specific data values, the statistics can exactly represent every distinct value. For indexes on larger tables with thousands or millions of rows or distinct values, the estimates aren't always exact, but they're usually close.

As of SQL Server 7.0, you can also create statistics on columns even without creating an index on those columns. In fact, by default, SQL Server 2000 and 7.0 build statistics on any column you use in a search argument, if that column doesn't already have an index. These automatically created statistics have names that start with _WA_sys. In SQL Server 7.0, using the system stored procedure sp_helpindex to ask for a report on all the indexes returns the names of these statistics along with the regular indexes. You can't drop these "indexes" by using the DROP INDEX command; instead, you have to use the DROP STATISTICS command. However, these statistics can be quite useful and they take up very little space, so you probably have no reason to drop them. In SQL Server 2000, these statistics still exist, but the rewritten sp_helpindex procedure doesn't return nonindex statistics. To see a list of all the statistics on a SQL Server 2000 table, you can use the stored procedure sp_helpstats.

So, what can you use statistics for? Suppose you had statistics on the Orders table's customerID column but no index on that column. Creating statistics without an associated index might seem like wasted effort: Even if the statistics told the optimizer that only one row satisfied a query, what good would it do? Without an index to help find that row, SQL Server still has to scan the entire table. However, even without a corresponding index, you can benefit from having statistics because they help the optimizer figure out the best way to process join operations. (Join processing is a huge topic, so I'll save that discussion for another month.)

A few months ago, I realized another benefit of having statistics on a column. This capability didn't exist until SQL Server 7.0 Service Pack 2 (SP2); it also exists in SQL Server 2000. To understand this use of column statistics, keep in mind the principle behind covered queries, which I discussed in "The Big Cover-Up." The leaf level of a nonclustered index contains every key value. In general, SQL Server uses a nonclustered leaf-level scan only when the index completely covers the query so that SQL Server doesn't need to access the data pages. Also remember that if an index is composite, the index keys in the nonclustered index's leaf level contain all the columns that make up the index.

SQL Server sorts the data in any index by the leftmost indexed column. So in general, if you don't have a search argument on the leftmost column, an index isn't useful. Index statistics just store the histogram for the first column. Therefore, an index on ProductID and Quantity in the Order Details table in the Northwind database would have a histogram of the distribution for only the ProductID column. If you want statistics on the Quantity column, you must either create a separate index on that column or create column statistics.

What if you run a query looking for a unique value in the Quantity column? Because Quantity is the second column of the composite index, SQL Server can't use this index to help find the one satisfying row. However, if you had statistics on the Quantity column, the optimizer would know how many times the specific value of Quantity occurred. The optimizer could then choose to scan the nonclustered index's leaf level, which contains all the value combinations for ProductID and Quantity, to find the pointer to the row containing the Quantity value you're looking for.

To see when the optimizer uses this technique, let's look at a table larger than anything in the sample Northwind database. Even with 2155 rows, the Order Details table is only slightly larger than one extent, and the optimizer usually doesn't decide to do anything complicated for tables that small. So, let's make a copy of Order Details and increase its size tenfold. Then, we'll update some rows so that the number of rows with the desired value is very small.

Listing 1 shows the script that you can use to generate the data for the test. After you run the code in Listing 1, only one row with a Quantity value of 54 exists in the test table, OrderDetails. But no index on the table has the Quantity column as the leftmost column. I want SQL Server to realize that only one row with this value exists. Then, I want the optimizer to decide that scanning the nonclustered index, which contains all the combinations of ProductID and Quantity values, will find that row more efficiently than a complete table scan can find it. Although SQL Server will then need to follow the bookmark from the nonclustered index to the data row containing Quantity 54, that process will still be more efficient than a table scan.

You can run the following query to see that only one row has the value 54:

-- Query 2:
SET STATISTICS IO ON
SELECT * FROM OrderDetails
WHERE Quantity = 54

You can also run sp_helpstats on the OrderDetails table (or sp_helpindex if you're using SQL Server 7.0) to see that statistics exist on the Quantity column.

However, the query plan for Query 2 shows that SQL Server didn't use the new trick I described; it just scanned the table. The estimated number of rows shows that the statistics haven't been updated since I ran the UPDATE statement from Listing 1. SQL Server still thinks that the table contains 10 rows that have a Quantity value of 54. But if I force SQL Server to update the statistics, the optimizer will choose the index scan:

UPDATE STATISTICS OrderDetails
GO
SET SHOWPLAN_TEXT ON
GO
SELECT * FROM OrderDetails
WHERE Quantity = 54

Figure 1 contains the output of the query plan after the statistics were updated.

Note that the revised plan involves both a nonclustered-index scan and a bookmark lookup. Before SQL Server 7.0 SP2, this combination of steps rarely occurred. SQL Server usually scanned a nonclustered index because it was a covering index, so it didn't need to perform bookmark lookups to access the data pages. But the SQL Server query optimizer is a work in progress. The development team at Microsoft is constantly evaluating query performance and looking for ways to make the optimizer even smarter. Microsoft doesn't always document the changes because making performance enhancements isn't like fixing bugs. Getting a new-and-improved optimizer is just a fringe benefit of upgrading, either to the next service pack or to the next release of SQL Server.

Comments

Plain text