Clustered Indexes Based Upon GUIDs

GUIDs and Clustered Indexes

Some things go great together: chocolate & peanut butter, movies & popcorn, Mork & Mindy, to name a few. However, there are even more things that don't mesh well: brushing your teeth & drinking lemonade, Mondays & happiness, myself & reaching the top shelf in the grocery store . . . GUIDs and Clustered Indexes.

Today, we're going to look at the last of those itemized couples. I'll briefly discuss why this combination is bad and what can be done to make it better (to some extent). This is a topic I'll be touching on in more detail in one of my IT/DEV Connections talks in Las Vegas in 2014 on Assessing Your Indexing Strategy.

What exactly are GUIDs and why are they so bad? Well, let's not give GUIDs a societal complex here—GUIDs (inherently) aren't bad—but they can be used in ways that lend to very bad performance issues inside of SQL Server. It's that issue that we'll be looking at here. "GUID" stands for "Globally Unique Identifier," is a 16-byte-length data type available in SQL Server that is often used for "uniquifying" a table row due to the unlikely chance that it will ever exceed the amount of possible values that exist in its range when compared to other data types used for such tasks:

Integers: -2^31 through 2 ^ 31 equal to 4,294,967,294 possible combinations.
BigInt: -2^63 through 2 ^ 63 equal to 18,446,744,073,709,551,614 possible combinations.
GUIDs: 2^122 equal to 5,316,911,983,139,663,491,615,228,241,121,400,000 possible combinations.

(My thanks to Bing and Blogger MrDee for doing the theorhetical maths heavy lifting. Also, did you know that you can't paste an accurate value for 2 ^ 63 into Microsoft Excel or the Windows Calculator application? You learn something new each day).

So the good news is that you can use GUIDs to increase the availability for unique rows in a table using a single column (we'll not discuss options such as multi-column keys here as another alternative but it exists). The bad news is that it's going to cost you an additional 12 bytes (compared to integer data-typing a column) or 8 bytes (if you're comparing bigint to GUIDs.) There is also the possibility of injecting institutional fragmentation from the time you insert your second row into a table using GUID as a clustering key unless you take precautions.

GUIDs Are Not Sequential

GUIDs, by default, are not sequential and there lies the crux of the fragmentation dilemma. To illustrate the use of GUIDs I'm going to run the following code to load up some records into a table where a GUID-typed column is used as the clustered index key and as the primary key constraint:

CREATE TABLE [dbo].[tblGUID_test]
 (
 [id] [UNIQUEIDENTIFIER] NOT NULL,
 [name] [CHAR](969) NOT NULL,
 date_stamp DATETIME NOT NULL,

 CONSTRAINT [PK_tblGUID_test] PRIMARY KEY CLUSTERED (id) 
  WITH (PAD_INDEX = OFF, FILLFACTOR = 100)
 )
GO

If I insert rows into the table (each row ends up being 1,000 bytes in length using the data types above), we get the following results from a call against the sys.dm_db_index_physical_stats:

INSERT INTO tblGUID_test(id, name, date_stamp) 
VALUES (NEWID(), 'FOO', GETDATE());
GO 80000

SELECT ixP.index_level
 , CAST(ixP.avg_fragmentation_in_percent AS DECIMAL(5,2)) AS avg_frag_pct
 , ixP.page_count 
 , ixP.avg_record_size_in_bytes AS avg_bytes_per_record
 , CAST(ixP.avg_page_space_used_in_percent AS DECIMAL(5,2)) AS avg_page_fill_pct
FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('tblGUID_test'), 1, NULL, 'detailed') ixP
 INNER JOIN sys.indexes I 
  ON I.index_id = ixP.index_id 
   AND I.object_id = ixP.object_id;

From the "get-go," you'll see that we're fragmented almost to 100 percent just by the use of GUIDs instead of a sequential process using ints or bigints coupled with an auto-incremented process via the IDENTITY property of a table column.

To illustrate how quickly GUID-based indexes become fragmented lets go back to the beginning. I'm truncating the table and staring from scratch by entering in 8 records. This is the state of fragmentation after those first 8 records:

This is to be expected; we've yet to fill a page. How about adding one more?

After that 9th record, we encounter our first page split. Since the GUIDs are flowing into the clustered index non-sequentially this means we're going to split the pages and balance the contents in a (roughly) equal fashion. We have 9 rows in the table so it can't be a equal distribution. What happens if we add just two more rows? Surely since the pages are only 55 percent full and we know we can fit 9 rows per page we'll still have two pages in the leaf level of the index, correct?

Guess again. We've encountered another page split with just adding two more rows. We're not utilizing space very well are we, what with the fact we're using 16-byte GUIDs rather than a 4-byte INT or 8-byte BIGINT column instead and even though we're specifying a 100 percent fill factor on the index we've not reached that mark since prior to the first page split. Let's try one last little experiment with some mathematical overtones. . .

We have three index pages that are roughly 50 percent full (give or take). We know we can fit nine rows per page and our fill factor is 100 percent. Theoretically, we should be able to insert 12 rows: (9 per page * 50 percent rounded to 4 since we can't insert a half record) * 3. As we've seen already, the liklihood of filling these three pages at 99 percent or so fill is not very likely without an index rebuild so the question is just how bad is this going to fragment the index? Let's find out...

After adding 12 more rows to the table:

Well, that's awkward . . . adding two rows, which was 2/11 of the total count of rows caused a page split and added one more page to the leaf level of the index. Adding 12 more rows, which was 12/23 of the total count of rows caused a page split as well—but only one—even though the proportion of pages added overall was much greater than when we only added our 10th and 11th row to the table and encountered our second page split.

Let's try one more experiment really quick; we're going to truncate the table and add all 23 rows again. Since we're using the GO # command we've been adding these all along as single transactions—we've just reported the results/impact on sys.dm_db_index_physical_stats after 8, 9, 11 and 23 inserts. This means that we should see similar results if we go through the process all over again and report out after we've inserted 23 rows:

TRUNCATE TABLE tblGUID_test;
GO

INSERT INTO tblGUID_test(id, name, date_stamp) 
VALUES (NEWID(), 'FOO', GETDATE())
GO 23

SELECT ixP.index_level
 , CAST(ixP.avg_fragmentation_in_percent AS DECIMAL(5,2)) AS avg_frag_pct
 , ixP.page_count 
 , ixP.avg_record_size_in_bytes AS avg_bytes_per_record
 , CAST(ixP.avg_page_space_used_in_percent AS DECIMAL(5,2)) AS avg_page_fill_pct
FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('tblGUID_test'), 1, NULL, 'detailed') ixP
 INNER JOIN sys.indexes I 
  ON I.index_id = ixP.index_id 
   AND I.object_id = ixP.object_id;
GO

Well, results are close at least. Due to the random nature of GUID values, we've encountered better fragmentation results internal on the pages—we're still filling the pages with the same number of rows so the overall average page fullness remains unchanged since we're still dealing with a four-page index at this point. Consequently, running the script that truncates and loads the table again with 23 rows does result in varying index page counts and fragmentation. I've run this multiple times and return with 50 percent or 75 percent average fragmentation across four leaf-level pages filled approximately 71 percent full as we've seen here but have also ended up with a five page leaf-level index with 40 percent average fragmentation that is 57 percent full on average.

In other words—consistently inconsistent. I'm a DBA, I hate things that are consistently inconsistent.

By Contrast, INT and BIGINT Clustering Keys Using IDENTITY Property

If you're not concerned about exhausting the available values that are possible with INT or BIGINT, then I'd always advocate for using one of those data types for a surrogate clustering key when compared to a GUID. Unlike a GUID, your able to INSERT sequentially, you're able to accomplish the ability to uniquely identify a row in a table, and you're able to do so with less space consumed much less fragmentation.

Since INT and BIGINT behave the same, let's pick INT, which is 4-bytes in length instead of 16-bytes with our GUID example. We'll create a new table with a row length of 1000 like our GUID example and insert 8, 9, 11, and 23 rows once again to compare the behavior versus that of the GUID clustering key:

CREATE TABLE [dbo].[tblINT_test]
 (
 [id] [INT] IDENTITY(1,1) NOT NULL,
 [name] [CHAR](981) NOT NULL,
 date_stamp DATETIME NOT NULL,

 CONSTRAINT [PK_[tblINT_test] PRIMARY KEY CLUSTERED (id) 
  WITH (PAD_INDEX = OFF, FILLFACTOR = 100)
 )
GO

--=====================================================
--RUN IN BATCHES OF 8, 1, 2, 12 REPORTING ON OUTCOMES:
--=====================================================
INSERT INTO tblINT_test(name, date_stamp) 
VALUES ('FOO', GETDATE())

Results initially look quite similar for GUIDs v. INT:

We end up with one less page (25 percent savings over the GUID option), and as a result, a better page fullness. However, we really don't care about outcomes when looking at such small tables—as DBAs we're concerned about what happens when our databases are like our grandfathers: large, unruly, bitter and smelling a bit like the trunk of a 1974 Oldsmobile. Let's compare what we saw earlier with the GUID-based clustered index with 80,000 rows versus the same count of rows using the INT key alternative:

80,000 Row GUID-Based Clustered Index

80,000 Row INT-Based Clustered Index

10,000 v. 14,712 leaf level pages when comparing INT-based v. GUID-based clustered indexes
Sequential INT-based clustered indexes result in almost no fragmentation at scale—you'd get similar results with BIGINT
A more narrow B-Tree index with INT-based clustered indexes compared to GUID-based clustered indexes (36 intermediate pages compared to 66 in my example.)

INTs, BIGINTs and Even SMALLINTs . . . for the Win!

When architecting your underlying table structure, I'd steer clear of using GUIDs. Examine INTs or BIGINT columns based upon your expected scale (if you know for certain your scale is small you can also examine SMALLINT). GUIDs come with a great deal of overhead and performance hit just to increase the pool of available values open for uniquely-identifying rows in a table. You'd be better off adding a secondary column to deal with scale issues rather than rely on GUIDs in my opinion. The space concerns and institutional fragmentation alone are reasons to keep plenty of space between your tables and GUIDs.

Comments

Plain text