Q: I’ve heard that SQL Server stores the clustering key (the columns that define the clustered index) in the nonclustered indexes. Why is that? And is this a factor I should consider when choosing my clustering key?
A: Yes, the clustering key is stored in each and every nonclustered index on the table, and this is a big factor in choosing a clustering key. The clustering key is stored in the nonclustered indexes because SQL Server uses the clustering key to look up the corresponding row from a nonclustered index into the base table. However, why SQL Server does this isn’t intuitive or obvious, so I’ll explain using the typical analogy for nonclustered indexes—indexes in the back of a book. All the indexes in a book point to the same data, but they let you look something up in different ways. In a book, the actual data is found using a page number. In a SQL Server table with a clustered index, the data is found using the clustering key.
So, why doesn’t SQL Server use the page number to find the table records exactly as an index in the back of a book? If the table doesn’t have a clustered index, it actually does use the page numbers! The data records in a heap (i.e., a table without a clustered index) can be referenced using only their physical location in the data file, rather than their logical location in the clustered index. This can lead to interesting performance issues if the heap data record needs to change location for any reason. When the table has a clustered index, it doesn’t matter what the physical location of the data record is—it can always be found using the clustering key.
So the nonclustered indexes must include the clustering key to enable the data records to be found (for instance, to retrieve table columns that are in a query SELECT list, but aren’t present in the nonclustered index the query optimizer chose to process the query). But what is a good clustering key choice? Given that the clustering key is included in all nonclustered indexes, it makes sense that the clustering key should be as small as possible and that it shouldn’t change so that the nonclustered indexes don’t have to change in turn.
Ultimately, what you’re really looking for is a clustering key that’s narrow, static, and unique, as well as one that doesn’t cause excessive fragmentation. This combination is the part that doesn’t jump out at you when you’re defining your schema/indexes, and, as a result, goes unnoticed until you see problems. Typical problems include inserts that are running slower or even timing out; queries that are running slower because of excessive I/Os; databases growing rapidly (especially the transaction log) because fragmentation (i.e., page splits) takes significantly more log space; and overall data bloat (e.g., wasted space, wasted cache, wasted space in backups). So, although these assessments should be true, they’re not always true and that’s often the problem.
Now you know why the clustering key is in the nonclustered indexes, as well as how it affects the nonclustered indexes. These facts should influence your choice of the clustering key. Some examples of good clustering keys include clustering on an identity and clustering on a combination of columns.
A clustering key that is an identity column is naturally unique, narrow, and static, and it naturally reduces fragmentation by limiting inserts to the end of the table. A clustering key that is a combination of columns such as (OrderDate, OrderID) in tables that have varying access patterns (and possibly partitioning) is also naturally unique and static, is relatively narrow, and naturally reduces fragmentation by limiting inserts to the end of the table. In addition, a combination clustering key can be used for range queries. Although I wouldn’t choose a clustering key based on range queries, I would certainly leverage it if it were available.
A cluster key based on a GUID can be OK, but there are many issues to contend with depending on how the GUID is being generated. Using the client (e.g., .NET) to generate the GUIDs can be extremely problematic. GUIDs are definitely unique, relatively narrow (however, at 16 bytes they’re quite a bit wider than an integer-based identity), and should be static (although this would be application dependent). However, client-generated GUIDs create a tremendous amount of index fragmentation because there’s no pattern to their inserts.
Using the server-side function NEWID() to generate the GUIDs can be extremely problematic for exactly the same reasons as client-side GUID generation. Generating GUIDs using the server-side function NEWSEQUENTIALID() offers the same benefits as for client-side GUIDs (unique, relatively narrow, static), but there are some potential problems. Instead of creating a lot of index fragmentation, these GUIDs are relatively sequential. They’re generally ever-increasing values, but because of how they’re generated, the base value on which they’re defined might change (typically because of a server restart or time change on the server, but there are other factors). As a result, they significantly reduce index fragmentation but they don’t eliminate it. Most tables see only 5 to 10 percent index fragmentation instead of the 99 percent index fragmentation that a client-side or NEWID()-generated GUID causes.