Why Doesn’t SQL Server Use the RID for Lookups from a Nonclustered Index into a Clustered Table?

Q: Why doesn’t SQL Server use the RID for lookups from a nonclustered index into a clustered table?

A: Actually, that is almost how versions prior to SQL Server 7.0 worked. In prior versions (before 7.0) SQL Server used a volatile RID – which would follow the row on every update. More specifically, when a row was updated in these earlier versions and the row needed to physically move within the clustered index, then all of the nonclustered indexes needed to be updated as well. Given that you can have 249 nonclustered indexes on each table, the work to update these nonclustered indexes could become terribly expensive (in terms of time, disk space, logging, etc…). As a result, the architecture was changed in SQL Server 7.0 to use the clustering key (instead of this volatile RID) – if the table is clustered. This is an optimization so that even when a row must be relocated the nonclustered indexes don’t need to be updated (as long as the clustering key value doesn’t change). And, it’s important to note that there’s a similar optimization in heaps (tables that do not have a clustered index) where they use a fixed/physical RID for lookups. Even when the record is relocated this “fixed” RID does not change. Instead, SQL Server using a forwarding record to point from the original (ie. fixed) physical RID to the new location of the record. If the row moves a second or third time, the point will always only point once – from the original RID to the new location. This allows you to have more nonclustered indexes without any additional overhead when records relocate. I’m not suggesting that you create a lot of nonclustered – there are other forms of overhead but this particular expense has been eliminated. All of these optimizations – in both heaps and clustered tables – reduce the overall metadata modifications, logging and therefore improve performance. And, this is yet-another reason for choosing a specific clustering key. Choosing a clustering key value that’s heavily updated would be a very bad choice.

Some have also asked me why the SQL Server team chose to use the clustering key and not just use the RID approach for clustered tables. For this, there are multiple reasons. First and foremost, data can move a lot more in a clustered table as the table’s order has to be maintained. A fixed/physical RID really wouldn’t work here. Second, and this is a bit of an optimization as well, the RID is an eight byte value that takes space to store it within the nonclustered indexes and it’s not “real” data. If actual data could be used within nonclustered indexes then we might be able to save space and get more uses out of our nonclustered indexes because they’ll automatically have an additional column of data in the index? And, when the clustering key is narrow, then you won’t unnecessarily waste space. However, this can go horribly wrong too. What if you were to place the clustered index on a column such as LastName? SQL Server does allow you to create this clustering key; however, behind the scenes SQL Server will enforce uniqueness by adding an integer to the values that are non-unique. This value is added to the data row (at insert) and then is automatically part of every nonclustered index. This value is necessary to to make sure that every row is identifiable from the nonclustered indexes but it wastes both time and space.

So, that also brings me back to one of the more common points I’ve made about the clustered index. Choose it wisely and choose it well. You really want a value that’s naturally unique, naturally narrow and naturally static. Ideally, the value should also reduce fragmentation but not having too many insertion points.

Comments

Plain text