Discover the Benefits of Filtered Indexes and Statistics

Executive Summary: Microsoft SQL Server 2008’s filtered indexes and statistics feature lets you improve query performance without wasting disk space, offers more detailed and more accurate distribution statistics and therefore better selectivity estimates from the optimizer, and provides an alternative to the commonly requested unique constraint with multiple NULLs.

SQLServer 2008 introduces a very cool new feature—filtered indexes and statistics. This new feature lets you improve the performance of your queries without wasting disk space; offers more detailed and more accurate distribution statistics, which provide better selectivity estimates from the optimizer; and provides an alternative to the commonly requested unique constraint with multiple NULLs. I’ll use several examples to introduce you to the concept of filtered indexes and statistics and to demonstrate their use.

Filtered Indexes and Statistics

In SQL Server 2008, you can define nonclustered indexes and statistics on a subset of rows rather than having to create them on the whole set of rows from a table. You achieve this task by specifying a predicate in a WHERE clause as part of the CREATE INDEX or CREATE STATISTICS statement. For example:

CREATE INDEX idx1 ON dbo.T1(keycol) WHERE col1 = 5;

The SQL Server 2008 implementation lets you specify fairly simple predicates. You can use IN; the comparison operators IS NULL, IS NOT NULL, =, <>, !=, >, >=, !>, <, <=, !<; and the logical operator AND. We can assume that Microsoft will continue to expand this list in future releases. Also, for now, filtered indexes cannot be created on indexed views, or using computed columns, user data types (UDTs), HIERARCHYID, or spatial types.

The ability to define filtered indexes provides several benefits when you frequently query specific subsets of rows. Compared with creating ordinary indexes, creating indexes on a specific subset of rows results in savings beyond the obvious disk space saved. For example, only changes in the subset of rows in the underlying table will involve modifying the corresponding index rows. Index maintenance operations such as index rebuilds and reorganizations will naturally be faster and cheaper. Consider also the fact that the maximum number of steps represented in distribution statistics (histograms) is limited. Because a filtered index represents only a subset of the rows from the table, each step in the histogram represents fewer rows compared with a nonfiltered index. As a result, filtered indexes and statistics are more accurate.

A scenario in which you might find filtered indexes useful is when you need to index only known values, but the column you want to index has a large number of NULLs and you never need to filter NULL values. Typically, every row from the table has a corresponding row in the index, including when the index key is NULL. But if those NULLs are not relevant to your queries, you will benefit from defining the index on the subset of rows with only the known values in the index key column. As an example, suppose that you want to create an index that will support queries against the Sales.SalesOrderHeader table in the AdventureWorks database, filtering a known currency rate ID. You can create an index on the subset of rows that have a known value in the CurrencyRateID column like so:

USE AdventureWorks;
CREATE NONCLUSTERED INDEX idx_currate_notnull
  ON Sales.SalesOrderHeader(CurrencyRateID)
  WHERE CurrencyRateID IS NOT NULL;

Here’s an example for a query (call it Query 1) that filters all rows with a certain known currency rate ID:

SELECT *
FROM Sales.SalesOrderHeader
WHERE CurrencyRateID = 4;

Because the optimizer is smart enough to realize that this filter is supposed to exclude NULLs, it will consider using the index idx_currate_notnull. Then, whether the index is used depends on the filter’s selectivity. As the execution plan for this query shows (see Figure 1), the index was used for the query.

When you use a range filter predicate to create a filtered index, the optimizer considers using the index even when the query requests a subinterval of the range. As an example, the following index is defined on the Sales.SalesOrderHeader table, with the Freight column as the key, based on the predicate Freight >= $5000.00:

CREATE NONCLUSTERED INDEX idx_freight_5000_or_
   more
  ON Sales.SalesOrderHeader(Freight)  HERE Freight >= $5000.00;

The following query (call it Query 2) demonstrates a request for a subset of rows from the table based on a subinterval of the range defined in the filtered index idx_freight_5000_or_more:

SELECT *FROM Sales.SalesOrderHeaderWHERE Freight BETWEEN $5500.00 AND $6000.00;

Figure 2 shows the plan for this query. As you can see in the execution plan, the index idx_freight_5000_or_ more was used.

Filtered indexes can also have an INCLUDE clause, which is useful when you want the index to cover certain queries and thus avoid lookups. For example, the following code creates an index on the OrderDate column; includes the columns Sales- OrderID, CustomerID, and TotalDue; and filters only rows where TerritoryID is equal to 5:

CREATE NONCLUSTERED INDEX idx_territory5_
  orderdate
 ON Sales.SalesOrderHeader(OrderDate)
 INCLUDE(SalesOrderID,
CustomerID, TotalDue)
 WHERE TerritoryID = 5;

Consider the following query (call it Query 3):

SELECT SalesOrderID,
CustomerID, OrderDate, TotalDue
FROM Sales.SalesOrderHeaderWHERE TerritoryID = 5;

Figure 3 shows the execution plan for this query. As you can see in the plan, the optimizer uses the index idx_territory5_orderdate to provide complete coverage. Because the query requests all rows in which the territory ID is 5, the optimizer chooses a full scan of the leaf of the index.

Continue to page 2

Add to the query a filter based on the index key column—OrderDate in our case—as the following query shows (call it Query 4):

SELECT SalesOrderID, CustomerID, OrderDate,
 TotalDue
FROM Sales.SalesOrderHeader
WHERE TerritoryID = 5
 AND OrderDate >= ‘20040101’;

Figure 4 shows the execution plan for this query. The plan shows that the optimizer efficiently uses a seek operation (plus a partial scan) against the index idx_ territory5_orderdat e, because the index not only covers the query but also sorts the data by OrderDate. Only the relevant range of rows is scanned within the index since OrderDate is the first (and only) key in the index.

Similar to filtered indexes, SQL Server 2008 also lets you create filtered statistics. For example, the following code creates filtered statistics on the OrderDate column in the Sales.SalesOrderHeader table, only for the subset of rows from the table where the territory ID is 4:

CREATE STATISTICS stats_territory4_orderdate ON
 Sales.SalesOrderHeader(OrderDate) WHERE TerritoryID = 4;

Filtered indexes can also be used in conjunction with another new SQL Server 2008 feature—sparse columns, which don’t consume any storage space for NULLs. Sparse columns is a topic for another article, but for now I’ll just say that in SQL Server 2008 you will be able to create a large number of sparse columns (as many as 30,000) and support those columns with a large number of indexes/statistics (as many as 1,000).

Unique Constraint with Multiple NULLs

You can define a filtered index as unique if you need to. Doing so can help the optimizer and can be used to enforce uniqueness. For example, DBAs sometimes need to enforce a unique constraint that allows multiple NULLs (i.e., prevents duplicates in known values, but not with NULLs). Although SQL Server doesn’t support such a constraint, filtered indexes offer an easy alternative. The following code shows an example:

IF OBJECT_ID(‘dbo.T1’, ‘U’) IS NOT NULL DROP TABLE dbo.T1;

CREATE TABLE dbo.T1(col1 INT NULL, col2 VARCHAR(10) NOT NULL);

CREATE UNIQUE NONCLUSTERED INDEX idx_col1_
  notnull
 ON dbo.T1(col1)
 WHERE col1 IS NOT NULL;

As you can see, a unique index is created on T1.col1, with a filter that excludes NULLs. This means that uniqueness will be enforced on known values but not on NULLs because NULLs aren’t part of the index. To test the index, try to insert two rows with the same known col1 value:

INSERT INTO dbo.T1(col1, col2) VALUES(1, ‘a’);
INSERT INTO dbo.T1(col1, col2) VALUES(1, ‘a’);

The first INSERT will be successful, but the second will fail with the error message in Figure 5. Conversely, you will have no problem inserting multiple rows with NULL in col1:

INSERT INTO dbo.T1(col1, col2) VALUES(NULL, 'a');
INSERT INTO dbo.T1(col1, col2) VALUES(NULL, 'a');

When you’re done experimenting with filtered indexes, run the following code for cleanup:

USE AdventureWorks;
DROP INDEX Sales.SalesOrderHeader.idx_currate_
   notnull;
DROP INDEX Sales.SalesOrderHeader.idx_
   freight_5000_or_more;
DROP INDEX Sales.SalesOrderHeader.idx_
   territory5_orderdate;
DROP STATISTICS Sales.SalesOrderHeader.stats_
  territory4_orderdate;DROP TABLE dbo.T1;

Benefits

Filtered indexes and statistics let you optimize your queries for certain subsets of rows but don’t incur the same costs as regular indexes. Filtered indexes consume less storage space, are faster and cheaper to maintain, and can be used as an alternative to a unique constraint that allows multiple NULLs.

Comments

Plain text