Implicit or Explicit Index Keys

As I've mentioned, I get many article ideas from questions that people ask me directly or that come up on the public newsgroups. When the same question comes from several directions at once, I start thinking that maybe it's time to write about that question. My idea for this month's column started when a student in my class asked me a question after I described the structure of nonclustered and clustered indexes. As I mentioned in my November 2004 column, "Get Into Index Structures," when a table has a clustered index, any nonclustered index will use the clustered index key as a pointer to keep track of where in the table to find the row corresponding to a particular nonclustered index key. For example, if your clustered index is on Social Security number and you have a nonclustered index on last name, the leaf level of the nonclustered index will contain every last-name value and the corresponding Social Security number. SQL Server treats the Social Security number as if it were another key column of the nonclustered index.

The student asked me about the performance implications of explicitly declaring the clustered key as part of the nonclustered index. Specifically, he was asking if there was any reason not to define a nonclustered index on both last name and Social Security number in a scenario like the example above. SQL Server won't duplicate columns if they're included in both the clustered key columns and the nonclustered key columns, so I said that if the table had a clustered index on Social Security number, then a nonclustered index on last name and a nonclustered index on last name and Social Security number would be identical.

Less than 2 days after the class, the same question came up on the private MVP newsgroups. One MVP mentioned that Peoplesoft applications tend to explicitly name the clustered index columns in their nonclustered index definitions. He mentioned that this method has the benefit of explicitly naming all the columns included in the index without you having to check or remember exactly which columns the clustered index contains. However, this doesn't really affect storage resources or query performance.

I thought about this topic some more and realized that explicitly listing the clustered index columns in a nonclustered index could actually affect performance. This could happen if you listed the columns in a different order than the one that SQL Server assumes by default. I kept thinking about this subject and eventually decided to put it to the test. If explicitly listing the clustered key columns in a nonclustered index could give a performance benefit, I wanted to find an example of it.

To see this behavior for yourself, run Listing 1's script, which creates a copy of the Order Details table in the Northwind database, then copies the table into itself nine times, for a total of 1,103,360 rows. It then copies that big table into one called BigDetails, adding an identity column.

First, I wanted to make sure that there's no index size difference between when you explicitly include the clustered index columns in a nonclustered index definition and when SQL Server uses the clustered keys automatically. To check this, I created a clustered index on the table:

CREATE CLUSTERED INDEX Order_index
ON BigDetails(OrderNum, Discount)

Next, I created a nonclustered index on two other columns:

CREATE INDEX OrderID_Price
ON BigDetails(OrderID, UnitPrice)

I used the undocumented sp_MSindexspace procedure to check the size of this index:

EXEC sp_MSindexspace BigDetails, OrderID_Price

The reported size was 28,504KB. I then recreated the nonclustered index and included the clustered index columns:

CREATE INDEX OrderID_Price
ON BigDetails(OrderID, UnitPrice, OrderNum, Discount)
WITH DROP_EXISTING

Checking the size of this index, I found it to be the same 28,504KB. This nonclustered index has all the columns in the same order as they would be in the original nonclustered index, which has just two defined keys. The index starts with OrderID and UnitPrice and then includes the clustered keys OrderNum and Discount.

Now I wanted to find a query that would perform better if I changed the order of the keys in the nonclustered index, compared to if I didn't specify the clustered index columns. It took me so long to find such a query that I started to think maybe I was wrong. I wondered whether SQL Server overrode the order I specified for the keys when I included the clustered keys in the nonclustered index definition. To test this hypothesis, I ran a query that would take advantage of the order I used in the index creation. I first created the nonclustered index with the columns in a different order:

CREATE INDEX OrderID_Price
ON BigDetails(OrderID, Discount, UnitPrice, OrderNum)
WITH DROP_EXISTING

The following query is a covered query, which means everything the query needs is in the leaf level of the nonclustered index, so SQL Server doesn't have to go to the data pages to process this query. In the leaf level of this index, I assumed that that the keys were exactly in the order I specified when I created the index.

SELECT  OrderID, UnitPrice, OrderNum, Discount
FROM BigDetails
ORDER BY OrderID, Discount, UnitPrice, OrderNum

When I looked at the query plan for this query, I saw that SQL Server was scanning the nonclustered index and not performing any sort operation. That meant it was taking advantage of the sorted order of the data in the leaf level of the nonclustered index. As a final proof, I rewrote the query to order the data in the same order it would be in with only OrderID and UnitPrice in the nonclustered index (and OrderNum and Discount defined in the clustered index):

SELECT OrderID, UnitPrice, OrderNum, Discount
FROM BigDetails
ORDER BY OrderID, UnitPrice, OrderNum, Discount

The plan for this query included a sort operator, showing that SQL Server was storing the data in a different order when I explicitly listed the columns in the nonclustered index definition than when I didn't.

Having proved that I was on the right track, I kept looking for a query that would perform better when I included the clustered index columns in the nonclustered index definition and used a specific column order. The following query shows a performance difference:

SELECT OrderID, OrderNum, UnitPrice, Discount
FROM BigDetails
WHERE OrderID = 11077
AND Discount = .02 AND UnitPrice <20

To verify the difference between explicitly including the clustered index key and not including it, rebuild the nonclustered index on only the two columns that aren't included in the clustered index:

CREATE INDEX OrderID_Price
ON BigDetails(OrderID, UnitPrice)
WITH DROP_EXISTING

If you SET STATISTICS IO ON and run the preceding SELECT query, you should see that it takes SQL Server 25 logical reads to return the 512 rows of data.

Now rebuild the index to include the clustered columns, but in a different order, so that the columns used in the WHERE clause are listed before those not used and the most selective columns are listed before the least selective:

CREATE INDEX OrderID_Price
ON BigDetails(OrderID, Discount, UnitPrice, OrderNum)
WITH DROP_EXISTING

If you now rerun the SELECT query, you should see that it takes SQL Server only four logical reads to return the same 512 rows of data, which is about 20 percent of the work to run the query with the original index.

Just like the ORDER BY queries, these are covered queries. One advantage of including the clustered index columns in the nonclustered indexes (explicitly or not) is that you increase your chance of having a covered query. For this reason, I limited my testing to finding covered queries that would show a performance difference.

You can get performance benefits from explicitly listing the clustered index columns in a nonclustered index definition, but only if you want them in something other than the default order. My testing showed that many queries behave the same no matter what order you define the keys in, but some perform better with a different ordering. If you're considering building a covering index for a query, make sure you test various permutations of the index column order before deciding on the best solution.

Explicitly including your clustered index key columns in your nonclustered index definitions has one more advantage. Suppose you know that you'll always need a composite index on Column1, Column2, and Column3. If Column3 is your clustered index key, you could leave it out of the nonclustered index definition and still have the index you need. However, if you later drop or redefine your clustered index, you might not have the composite index on the same three columns anymore. By explicitly including all the necessary columns in the nonclustered index definition, you'll always have those columns no matter what happens to any other indexes.

Another MVP pointed out the drawback of including the columns. SQL Server has a size limitation of 900 bytes for all index key columns combined. If you don't explicitly list the clustered index columns in the nonclustered index definition, they don't count against the 900-byte limit (but SQL Server treats the columns as index keys for the nonclustered index). If you list the columns explicitly and they bring your total key size to more than 900 bytes, index creation fails.

Comments

Plain text