Get Accurate Estimates with Filtered Indexes

One of the benefits of using a filtered index is that it can help improve cardinality estimates, which can help the optimizer make better choices. Consider a case in which you have a filter with two predicates involved, as in the following query:

USE AdventureWorks2012;

SELECT SalesOrderID, SalesOrderDetailID, OrderQty
FROM Sales.SalesOrderDetail
WHERE ProductID = 785
  AND ModifiedDate >= '20070101';

Suppose that you have the following composite covering index to support such queries:

CREATE INDEX idx1
  ON Sales.SalesOrderDetail(ProductID, ModifiedDate)
    INCLUDE(SalesOrderID, SalesOrderDetailID, OrderQty);

This is a simplified example. Imagine that you needed to do something with the result, such as group and aggregate it or order it. The optimizer needs a cardinality estimate to know which aggregate algorithm to use and how much memory to allocate if a memory grant is required.

SQL Server doesn't support a multi-column histogram; therefore, it must resort to alternative methods to make the estimate. The index creation triggered the creation of a histogram on ProductID. Executing the query will also trigger the creation of a histogram on ModifiedDate. When I ran the query in SQL Server 2014, I got the estimate that Figure 1 shows.

SQL Server 2014 uses an algorithm called exponential backoff to make the estimate in our case:

 *  * SQRT() * SQRT(SQRT())...

The number of rows in the table is 121,317. The estimate for the predicate against ProductID is 220. The estimate of the predicate against ModifiedDate is 97493.2. Applying the exponential backoff algorithm to these arguments, you get:

121317 * (220./ 121317) * SQRT(97493.2 / 121317) = 197.2

As you can see in the plan, that's the estimate the optimizer came up with. The actual result is 109.

In SQL Server 2012, under the independency assumption, the estimate is simply the product of the percentages. So in our case it would be 176.7. You can test it in SQL Server 2014 by using the query trace flag 9841, forcing it to use the legacy cardinality estimator. Either way, as you can see, the estimates aren't perfect.

The estimates can be improved with a filtered index. Suppose that you had a very small number of products (not the case in AdventureWorks, but suppose it were the case)—small enough that it's not a problem for you to create a filtered index per distinct product ID to support queries like ours. Here's an example for such a filtered index for product 785:

CREATE INDEX idx1
  ON Sales.SalesOrderDetail(ModifiedDate)
    INCLUDE(SalesOrderID, SalesOrderDetailID, OrderQty)
  WHERE ProductID = 785
WITH (DROP_EXISTING = ON);

SQL Server creates a histogram on the ModifiedDate column for the subset of rows where the product ID is 785. You can request to see the histogram by running the following code:

DBCC SHOW_STATISTICS('Sales.SalesOrderDetail', 'idx1') WITH HISTOGRAM;

This means that, theoretically, SQL Server can now make a cardinality estimate for our filter that's as good as an estimate based on a multi-column histogram would have been. The interesting thing, though, is that when I ran our query after the creation of the filtered index, I got the estimate shown in Figure 2.

The optimizer did choose to use the filtered index, but the estimate seems to be worse now—it's based on the predicate ModifiedDate >= '20070101' alone. There was no real reason for the optimizer to try to come up with an accurate estimate because the query doesn't do anything with the filtered rows. If, for example, you add ORDER BY OrdQty at the end of the query, you'll see that the estimated number of rows coming out of the Sort operator will be 109, like the actual number.

I learned a trick from Paul White: If it bothers you (as it does me) that you don't get the correct estimate in the original query (albeit a benign case), all you need to do is add the filtered column ProductID as an included column in the index definition, like so:

CREATE INDEX idx1
  ON Sales.SalesOrderDetail(ModifiedDate)
    INCLUDE(SalesOrderID, SalesOrderDetailID, OrderQty, ProductID)
  WHERE ProductID = 785
WITH (DROP_EXISTING = ON);

When I reran the original query after the change in the index, I got the correct estimate, shown in Figure 3, based on the filtered histogram.

Figure 3: Estimate with Filtered Index and Included Element

I found this trick to be a pretty cool one, especially because the fix is so simple. You can find Paul's explanation of the logic in "Incorrect row estimation given with a filtered index."

When you're done, run the following code for cleanup:

DROP INDEX idx1 ON Sales.SalesOrderDetail;

A special thank you to Paul White for this T-SQL tip.

Comments

Plain text