Column Statistics Give the Optimizer an Edge

Downloads
97809.zip

Executive Summary:
Microsoft SQL Server 2005 can create statistics on unindexed table columns--called column statistics-which along with index statistics can help the SQL Server query optimizer determine the most efficient query plans. To enable column statistics, you need to make sure the auto_create_statistics database option is set to ON. Column statistics can help the query optimizer choose a better plan in specific cases, such as for queries involving multiple tables or tables that already have indexes.

Using statistics on unindexed columns can help determine the best query plans

SQL Server 2005 generates and automatically maintains statistics for every index you create, a feature that can help you in your query performance-tuning efforts. (See “Making the Most of Automatic Statistics Updating,” October 2007, InstantDoc ID 96767 for more information about automatic updating of index statistics.) In addition to providing statistics for indexes, SQL Server can create statistics on unindexed table columns—called column statistics. Along with index statistics, column statistics help the SQL Server optimizer create efficient query plans. By understanding how and when SQL Server creates column statistics and how they’re used, you can recognize situations when you could use them to improve the optimizer’s performance.

Working with Column Statistics

By default, SQL Server automatically creates statistics every time an unindexed column is referenced in a query’s WHERE or ON clause. This default applies only when the auto_create_statistics database option is set to ON, which is the default setting in the model database and thus in all new databases you create.

The examples I’ll employ in my discussion of column statistics use copies of the Person.Contact and Human Resources.Employee tables from the AdventureWorks database. To follow along, first make sure that auto_create_statistics is set to ON in the AdventureWorks database, by running the following statement:

ALTER DATABASE AdventureWorks
SET auto_create_statistics ON;

Next, run the code in Listing 1, which creates the two tables in the database. You can see all statistics that exist on a specified table by using the sp_helpstats system procedure. By default, sp_helpstats shows only the column statistics, but when you add a second parameter, 'ALL', the procedure will return both column statistics and index statistics.

To see how sp_helpstats works, run the following code to view the statistics on the dbo.Contact table before you run any queries against it:

EXEC sp_helpstats
‘dbo.Contact’, ‘ALL’;

(Some code in this article wraps to multiple lines because of space constraints in print.) You should see the message This object does not have any statistics or indexes. However, if you query the dbo.Contact table, then examine the statistics again, as the following code shows, you should see different results.

SELECT * FROM dbo.Contact
WHERE FirstName like ‘K%’;
GO
EXEC sp_helpstats ‘dbo.Contact’;

Your output should look similar to that in Table 1.

The results (without the 'ALL' parameter) show you the automatically created statistics on the FirstName column, but they’d also return any statistics you created manually by using the CREATE STATISTICS command. The automatically created statistics have a system-generated name that always starts with the eight characters “_WA_Sys_.” These eight characters are followed by a hexadecimal value indicating the ordinal position of the column in the table (the fourth column, in this case) and a hex value representing the table’s object_id. For the output in Table 1, the hex value 76EBA2E9 is equivalent to the decimal value 1995154153, which the object_name function confirms is the object_id of the dbo.Contacts table. Manually created statistics can have any name you choose as long as it’s a legal identifier name.

SQL Server 2005 also provides a catalog view called sys.stats that lets you list your column statistics, but to get the basic information provided by the sp_helpstats procedure, you’d need to join sys.stats with sys.stats_columns and sys.columns and include special processing for the cases where manually created statistics included multiple columns. If you also want to return information on index statistics, you must also include the sys.indexes view.

For the most part, SQL Server manages column statistics the same as index statistics, regardless of whether column statistics are automatically or manually created. If the auto_update_statistics database option is set to ON, column statistics will be automatically updated as your data changes. The point at which SQL Server will update the statistics is based on the same recompilation threshold that index statistics use—namely, when 20 percent of the rows in the table are modified. The UPDATE STATISTICS command will update all statistics on a table, both index and column statistics. Manually created statistics can be on multiple columns, but unlike index statistics, there’s no column or byte limit on column statistics.

Statistics Changes in SQL Server 2005

Microsoft completely rewrote the algorithms for SQL Server 2005 that determine how it gathers statistics, but you shouldn’t notice this change in your applications when you migrate from an earlier SQL Server version, except for perhaps better performance because the algorithms make the statistics more accurate.

There’s one difference you might notice after upgrading to SQL Server 2005, if you’re used to examining the database catalog and index ID values in sysindexes. In SQL Server 2000, the sysindexes.indid column is a tinyint, and values 2 through 250 are used for both index ID and statistics ID values. A clustered index always has a value in the indid column of 1. So although the documentation says you can have 249 nonclustered indexes on a single table, if you have many statistics on a table, in SQL Server 2000 you can run out of possible indid values before you create anywhere near 249 nonclustered indexes.

For example, suppose you have a table with 249 or more columns and no indexes. (SQL Server allows up to 1,024 columns in a table.) If you execute the sp_createstats procedure in a database, SQL Server will create statistics on every unindexed column in the entire database, and your table will get 249 statistics, with indid values ranging from 2 to 250. If the table has more than 249 columns, the sp_createstats command will actually report an error, telling you that it can’t generate some of the requested statistics. If you then try to create any nonclustered indexes, you’ll get another error because no more indid values are available.

In SQL Server 2005, the index ID values are stored as ints. The sysindexes compatibility view has rows for all indexes and all column statistics. The sys.stats catalog view contains rows for all statistics, both index statistics and column statistics. Each row in sys.stats has a stats_id value that’s unique for the corresponding object. Every index will have a corresponding statistics row in sys.stats with the same name (that is, the name of the index is the same as the statistics name), and the index_id column in sys .indexes is the same as the stats_id value in sys.stats. However, not every statistics row has a corresponding row in sys.indexes. There’s no overlap between the stats_id values for column statistics and the stats_id values for index statistics. This means that in SQL Server 2005, you’re no longer limited in the number of statistics you can have on a table, and having many column-level statistics won’t affect the number of nonclustered indexes you can have.

When Are Column Statistics Useful?

In general, statistics are used to give the optimizer information about how many rows will satisfy a given predicate in a WHERE clause or an ON clause. However, you might wonder what good it will do the optimizer to know how many rows will satisfy a given condition if there’s no index to use to access those rows. For example, you should be aware that a nonclustered index is typically a good choice only when just a few rows need to be accessed. The SQL Server optimizer uses the index statistics to determine whether there are only a few qualifying rows; if so, the optimizer will choose to use the nonclustered index. Column statistics might tell the optimizer that there are only a few rows, but without an index SQL Server can’t find those few rows efficiently.

For a single-table query with no indexes on the table at all, the only purpose that column statistics might serve is to tell you what columns your queries are referencing, to give you a clue about what columns you might want to build indexes on. However, for queries involving multiple tables, or tables that already have indexes, column statistics can serve a very definite purpose. I’ll show you two examples where the existence of column statistics can help SQL Server choose a better plan.

Choosing a Join Order

If you ran the script in Listing 1, you have a Contact table with 19,972 rows and an Employee table with 290 rows. Neither table has an index. Let’s build an index on the join column in the Employee table:

CREATE INDEX Emp_ContactID
ON Employee(ContactID);

The most efficient join plans usually occur when one table has an index on the join column and the other table is quite small or is returning only a few rows. But you have an index on the join column in the small table, and the other table is quite large. If you run the script in Listing 2 and look at the plan for joining these two tables, you’ll see that the optimizer chooses to do a hash join, which is rarely the most efficient type of join operation.

However, if you add a very restrictive WHERE clause and run the code in Listing 3, you should now see a nested loops join in the optimizer’s plan. Because of the automatic statistics on the FirstName column, the optimizer knows that only a few rows from the Contacts table will be joined to the Employees table, so the filtered Contacts table will be accessed first, followed by the index seek on the Employees table.

If auto_create_statistics wasn’t set to ON, however, the optimizer would have no way of knowing how many rows might satisfy the WHERE condition and wouldn’t risk performing the nested loops join with possibly thousands of rows. To see this situation, turn off the SHOWPLAN option and change the auto_create_statistics setting to OFF, by using the following code:

SET SHOWPLAN_TEXT OFF;
GO ALTER DATABASE AdventureWorks
SET auto_create_statistics OFF;

Now rerun the code in Listing 1 to recreate the two tables, then rerun the code in Listing 3. You should see that the optimizer will no longer choose to use the efficient nested loops join and will use the hash join instead. Column statistics on the FirstName column, even without an index, are crucial to helping the optimizer to choose a good join order and join algorithm.

Secondary Nonclustered Index Columns

Another situation in which column statistics can help is when values in the secondary columns in a nonclustered index are distributed very unevenly. You should be aware that for any composite statistics, SQL Server keeps a histogram for the first column only. To see how this works, set SHOWPLAN OFF and turn on auto_create_statistics, as follows:

SET SHOWPLAN_TEXT OFF;
GO
ALTER DATABASE AdventureWorks
SET auto_create_statistics ON;

Then rerun Listing 1. If you build the following index on the Contacts table, SQL Server will create a histogram of values for only the LastName column:

CREATE INDEX NameIndex ON Contact(LastName, FirstName);

Run the code in Listing 4 to view the plans for two similar queries. One looks for first names that start with K, and there are 1,255 of them. The other looks for first names that start with Y, and there are only 37 of them. Although the NameIndex doesn’t have distribution statistics for FirstName, SQL Server has built column statistics. The optimizer realizes that for the names starting with Y, there are few enough that it’s more efficient to scan the leaf level of the nonclustered index and perform a lookup into the base table 37 times. For the 1,255 rows with a first name starting with K, it’s more efficient for the optimizer to do a table scan, as there are only 568 pages in the table. Reading 568 pages in a scan is better than scanning all 101 pages of the nonclustered index, then following 1,255 pointers to the table, which would require another 1,255 page reads. But when looking for names starting with Y, reading 101 pages in the nonclustered leaf, plus doing 37 table lookups, is better than a table scan.

To see the effect of not having column statistics turned on in this case, you could again turn off auto_ create_statistics, as follows:

SET SHOWPLAN_TEXT OFF; GO ALTER DATABASE AdventureWorks SET auto_create_statistics OFF;

Run Listing 1 again to recreate the tables, and rerun the CREATE INDEX command to build the NameIndex index. Then run Listing 4 again, and you’ll see that now the optimizer chooses to do a table scan on both tables. Without column statistics on the FirstName column, the optimizer can’t know that an index scan plus table lookup will be more efficient than just scanning the entire table.

Keeping the Optimizer Informed

As you’ve seen, by allowing SQL Server to both automatically create column statistics and keep all statistics updated as your data changes, you can ensure that the optimizer has additional useful data for determining query plans. Enabling column statistics can go a long way toward helping the optimizer come up with the most favorable plans for your queries.

Comments

Plain text