Column Cardinality, Statistics, and Query Plans: Order Can Matter Tim Ford, https://www.flickr.com/photos/fordplay/sets/

Column Cardinality, Statistics, and Query Plans: Order Can Matter

It All Starts with Statistics . . . But First Let's Talk About Indexes

I want to take this discussion back to the simplest explanation of what an index is for a second (bear with me all you SQL pros out there, I want to make sure the SQL Server wiz-kids of tomorrow understand this before moving on).

Think of an index in a database as you would a phone book, you know, those things your parents used back in the 1970s to look up phone numbers before placing a call and that I sit on so as to see above the dinner table. Now, using that phone book, what would you do if instructed to look up all the "Trevor Moon" entries; you’d do so by turning to the first page where you find "Moon, Trevor" and then scan the remaining records one at a time until you run out of Trevors. In database parlance, you just employed an index (if one was created on last name and first name), to perform a seek to the first record that met the criteria, and then did a range scan until you ran out of matching values.

Related: Physical ordering of records in an index

Indexes are constructs built inside a database that allows for identifying subsets of data by column(s) for easier retrieval. The alternative would be to scan all the values in a table every time you’re looking for any record or records. The Query Optimizer (QO) needs to have an idea about how the data is distributed in an index when creating an execution plan. If the data is distributed evenly, it may decide to do a scan of the entire index (or a table scan if there is no matching clustered index). If the data is distributed in a less uniform fashion, it may determine to build a plan that will seek an index then look for a specific value or values that match the search predicate. At the heart of it all are the statistics that are built for each index that determine both the density (the uniqueness of the data) and the distribution (how the various column values are distributed) in any index.

Statistics Are Required

At this point the important thing to understand is that statistics for an index are required to allow the QO to make the right judgment calls when creating a correct plan (or at least a "good enough" plan), and those statistics provide information about density and distribution of the underlying data values. Now, what if I were to tell you that the entire set of statistics can be "gamed" or thrown off balance if an incorrect column is chosen for an index (or for the sake of this discussion on index order) the wrong column is the first column in a multi-column index?

Statistics in terms of both density and distribution weigh heavy upon the first column in an index, and hence, the crux of our discussion: Chosing the right first column for an index.

How The First Column of an Index Impacts Statistics

Statistics for data distribution are sorted into a series of "bins"—up to 200 per index—that the query optimizer can then use to make judgment calls when estimating how many rows are returned based upon the distribution of the data—crucial information when building a plan. If you place a column with low cardinality—very homogenous data where most rows for the columns are the same value or are the same value—you'll end up with very few bins. Matter of fact, if all the values are constant for that column you actually end up with a single bin, rather than a better distributed set of bins like perhaps what would be afforded if you took that classic bell curve shape and carved it equally in a vertical fashion across the x-axis into (up to) 200 sections. This is why it’s recommended you should usually place your most-selective column into the first position of a multi-column index.

In the case of low cardinality columns leading the index, the QO now thinks that all the data is distributed in such a fashion even though that is probably (hopefully), not the case. Your chances for getting saddled with an improper plan are now much higher than if your index was led with a column whose data distribution better represented the data distribution for the full index.

Now the fun part. Showing code and pictures to illustrate this point . . .

Real World Example

I've taken production code from a very large table—150 million rows—and masked the data, dropped the first 2 million rows into two identical tables in a new database. Here are the table creation structures (sans population):

CREATE TABLE dba_stats.dbo.tbl_control 
	(
		ID INT NOT NULL IDENTITY(1,1)
		, CLOCK_TICK DECIMAL(16,2) NOT NULL
		, CLOCK_TIME DATETIME NOT NULL
		, METRIC_CODE INT NOT NULL
		, [USER_ID] VARCHAR(6) NOT NULL
		, AUTH_CODE INT
		, ACTION_CODE BIT NOT NULL
	);


CREATE TABLE dba_stats.dbo.tbl_improve 
	(
		ID INT NOT NULL IDENTITY(1,1)
		, CLOCK_TICK DECIMAL(16,2) NOT NULL
		, CLOCK_TIME DATETIME NOT NULL
		, METRIC_CODE INT NOT NULL
		, [USER_ID] VARCHAR(6) NOT NULL
		, AUTH_CODE INT
		, ACTION_CODE BIT NOT NULL
	);

At this point, we have two identical tables. I've gone on to populate them in the same fashion with the same data. Now, we'll switch them up by implementing different indexes on each table:

ALTER TABLE dba_stats.dbo.tbl_control 
	ADD CONSTRAINT PK_tbl_control PRIMARY KEY CLUSTERED (ID) 
		WITH (FILLFACTOR = 100) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [NC_control_ACTION_CODE_METRIC_CODE] 
	ON dba_stats.dbo.tbl_control
(
	[ACTION_CODE] ASC,
	[METRIC_CODE] ASC
) WITH (FILLFACTOR = 100)
GO

CREATE NONCLUSTERED INDEX [NC_control_ACTION_CODE_CLOCK_TIME] 
	ON dba_stats.dbo.tbl_control
(
	[ACTION_CODE] ASC,
	[CLOCK_TIME] ASC
) WITH (FILLFACTOR = 100)
GO

 

ALTER TABLE dba_stats.dbo.tbl_improve 
	ADD CONSTRAINT PK_tbl_improve PRIMARY KEY CLUSTERED (ID) 
		WITH (FILLFACTOR = 100) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [NC_improve_METRIC_CODE_ACTION_CODE] 
	ON dba_stats.dbo.tbl_improve
(
	[METRIC_CODE] ASC,
	[ACTION_CODE] ASC
) WITH (FILLFACTOR = 100)
GO

CREATE NONCLUSTERED INDEX [NC_improve_CLOCK_TIME_ACTION_CODE] 
	ON dba_stats.dbo.tbl_improve
(
	[CLOCK_TIME] ASC,
	[ACTION_CODE] ASC
) WITH (FILLFACTOR = 100)
GO

One thing I'll tell you before we look at the data, is that ACTION_CODE is an extremely low cardinality column. All values are identical across all rows in that column. This is quite different from the METRIC_CODE and CLOCK_TIME columns which have a much higher cardinality, as can be seen in the partial results of the queries below: 

SELECT ACTION_CODE, COUNT(1) 
FROM dbo.tbl_control 
GROUP BY ACTION_CODE 
ORDER BY 2 DESC;

SELECT METRIC_CODE, COUNT(1) 
FROM dbo.tbl_control 
GROUP BY METRIC_CODE 
ORDER BY 2 DESC;

SELECT ACTION_CODE, COUNT(1) 
FROM dbo.tbl_control 
GROUP BY ACTION_CODE 
ORDER BY 2 DESC;

SELECT CLOCK_TIME, COUNT(1) 
FROM dbo.tbl_control 
GROUP BY CLOCK_TIME 
ORDER BY 2 DESC;

We're going to want to focus on the ACTION_CODE v. CLOCK_TIME columns to illustrate where low cardinalilty has an effect on performance. If we look at the stats for the two indexes that rely on the combination of those two columns in each table, we can draw some clear distinctions on the effect of low cardinality columns in that first position. 

The two indexes are:

  • NCI_control_ACTION_CODE_CLOCK_TIME on the "Control" table with the low cardinality ACTION_CODE column in the first positon and the higher cardinality CLOCK_TIME column in the second spot
  • NCI_improve_CLOCK_TIME_ACTION_CODE on the "Improve" table with the columns reversed in order.

Running DBCC SHOW_STATISTICS ("table_name," "index_name"); will provide us with three sections of metadata on our statistics for the identified index:

  • Header – general metadata about the set of statistics
  • Density – the density (uniqueness) of the various column(s) in the index
  • Histogram – the table that defines the distribution of the data. (This section delineates the bins in table form.)

Taking a look at the "bad" index first, with low cardinality in the front and good cardinality following, we see that our bin count is 1 and our density is extremely poor:

DBCC SHOW_STATISTICS ("tbl_control", "NC_control_ACTION_CODE_CLOCK_TIME");

The headers show us the total number of rows and rows sampled as well as steps; lesser-importance columns for this discussion have been omitted. Since the Rows and Rows Sampled columns match, you can determine the statistics were based from a full scan—they were created after a rebuild where all rows were scanned as part of the rebuild. If the Rows Sampled count was lesser than the Rows count, this would tell us that the statistics were based upon a sample of rows instead. We also can see how current the statistics are based upon the Updated column.

The next section, the density section, shows us the uniqueness of values in combination of columns as they are ordered in the index.  The first column is 100% non-unique, it's only after CLOCK_TIME is added to the index that uniqueness occurs in any form. Since this is a non-clustered index, the clustering key is ultimately also included in the index (step 3) in this case the clustering key is unique so 100% uniqueness occurs after all three index columns are taken into consideration.

Finally, we have the Histogram. Not much to see in this index since there is a single bin due to the low cardinality of the first column.  Every record falls into a single bin since there is no distribution of values. The QO will act as though the distribution for the entire index should be based from a single bin as a result.  We'll see the ramifications when we look at the execution plan for a query against this table at the end of this article.

What about when the index is simply ordered in a better manner with the same columns and data but with a higher cardinality column in the first position?

DBCC SHOW_STATISTICS ("tbl_improve", "NC_improve_CLOCK_TIME_ACTION_CODE");

In contrast we have much better density and 155 steps (bins) when compared to a single bin for the index having a low cardinality column leading with the same data and columns in the index but simply in a different ordinal position within the index!

By the time you get to step two in the density section both indexes appear identical statistically since the same combination of columns are present there and in step 3.  However you now have more bins - better represented distribution of data - and the Query Optimizer (QO) believes the data in the index to be distributed in a manner that is more like the actual data IN the index.  What does this mean for query executions?  That's where the real interest falls after all, right?

The Effect on Query Plans and Executions

To drive the point home, we're going to run the same query against both tables (using OPTION (RECOMPILE) to avoid any possible plan caching scenario):

------------------------------------------------------
---  QUERY CONTROL TABLE
------------------------------------------------------
SELECT METRIC_CODE, ACTION_CODE, [USER_ID]
FROM dba_stats.dbo.tbl_control
WHERE CLOCK_TIME >= '2009-10-08 00:00:00' 
	AND CLOCK_TIME < '2009-10-08 01:00:00'
OPTION (RECOMPILE)
GO


------------------------------------------------------
---  QUERY IMPROVE TABLE
------------------------------------------------------
SELECT METRIC_CODE, ACTION_CODE, [USER_ID]
FROM dba_stats.dbo.tbl_improve
WHERE CLOCK_TIME >= '2009-10-08 00:00:00' 
	AND CLOCK_TIME < '2009-10-08 01:00:00'
OPTION (RECOMPILE)
GO

The results do not matter, but suffice to say, the same results are returned for both queries against both tables:

(1499 row(s) affected)
Table 'tbl_control'. Scan count 1, logical reads 6075


(1499 row(s) affected)
Table 'tbl_improve'. Scan count 1, logical reads 4607

The I/O statistics are the first sign that something about using a higher cardinality column for the leading column in an index is a better choice:  A 25% reduction in I/O is a good start wouldn't you agree? If we compare the execution plans for both queries, we'll see why I/O was reduced for the second query:

It's because we ended up with a plan that performs a seek (with an associated range scan) that then employs a key lookup for specific values inside the table's clustered index, rather than performing a clustered index scan when the statistics are based upon the distribution of the low cardinality column (ACTION_CODE).

But Wait . . . I Said "Order CAN Matter"

Sometimes the QO will still decide that the cost of performing a table or clustered index scan is still cheaper than any other option.  Such is the case when we look at a query that would touch the indexes using the combination of ACTION_CODE and METRIC_CODE.  Here are those relevant indexes:

  • NCI_control_ACTION_CODE_METRIC_CODE on the "Control" table with the low cardinality ACTION_CODE column in the first positon and the higher cardinality METRIC_CODE column in the second spot.
  • NCI_improve_METRIC_CODE_ACTION_CODE on the "Improve" table with the columns reversed in order.

If we were to take a look at the statistics for these two indexes, we see similar results as we did with those indexes associated with ACTION_CODE and CLOCK_TIME:

DBCC SHOW_STATISTICS ("tbl_control", "NC_control_ACTION_CODE_METRIC_CODE");

DBCC SHOW_STATISTICS ("tbl_improve", "NC_improve_METRIC_CODE_ACTION_CODE");

In running a sample query against both tables, results in plans—even though the indexes and statistics are different—are arguably better for the tbl_improve table:

------------------------------------------------------
---  QUERY CONTROL TABLE
------------------------------------------------------
SELECT METRIC_CODE, ACTION_CODE, [USER_ID]
FROM dba_stats.dbo.tbl_control
WHERE METRIC_CODE = 134000 AND ACTION_CODE = 1
OPTION (RECOMPILE)
GO


------------------------------------------------------
---  QUERY IMPROVE TABLE
------------------------------------------------------
SELECT METRIC_CODE, ACTION_CODE, [USER_ID]
FROM dba_stats.dbo.tbl_improve
WHERE METRIC_CODE = 134000 AND ACTION_CODE = 1
OPTION (RECOMPILE)
GO

Be Sure to Consider the Possibilities of Key Order in Performance Tuning

I hope that this article has provided you with yet one more option to consider when performance tuning: key ordinal position in composite indexes as well as column selection for any index. When it comes right down to it, selectivity/cardinality/column choice has major impacts on plan creation, which in turn, is the crux to most performance tuning matters on your SQL Server instances.

Related: Behind the Scenes with SQL Server Included Columns and Covering Indexes

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish