Efficient Data Management in SQL Server 2008, Part 2

SQL Server 2008 introduced new features to help you work more efficiently with your data. In "Efficient Data Management in SQL Server 2008, Part 1," I explored the new sparse columns feature, which lets you save significant storage space when a column has a large proportion of null values. In this article, I'll look closer at column sets, which provide a way to easily work with all the sparse columns in a table. I'll also show you how you can use filtered indexes with both sparse and nonsparse columns to create small, efficient indexes on a subset of the rows in a table.

Working with Column Sets

A column set can add functionality to a table with sparse columns. As explained in part 1, a column set is an untyped XML column that you can use to get and set all sparse column values as a set. Remember that you can't add a column set to an existing table that already has sparse columns. If you attempt to execute an ALTER TABLE statement on such a table, you'll get one of Microsoft's new, informative error messages explaining the problem. You have three options to solve this problem:

delete and recreate the table, creating the column set along with the sparse columns
delete all the sparse columns and create an ALTER TABLE statement that adds them back along with the column set
create an entirely new table

I'll use the third option and use the code that Listing 1 (below) shows to create a table called DogCS, which includes the Characteristics column set. The code also inserts data for several dogs.

Because it has a column set, DogCS will behave differently from the Dog table in part 1. When you execute the following SELECT statement, you get the results that Figure 1 shows:

SELECT * FROM DogCS

Figure 1: Results of a SELECT * statement on a table with a column set

Notice that none of the sparse columns are included in the result set. Instead, the contents of the column set are shown as a snippet of XML data. The results in the figure might be a little hard to read, so here's the complete value of the Characteristics column set field for Mardy, the first row of the results:

1997-06-30
  62
  Metacam, Phenylpropanolamine,
  Rehab Forte

Adding a column set to a table schema could break any code that uses a SELECT * and expects to get back individual field data rather than a single XML column. But no one uses SELECT * in production code, do they?

You can retrieve the values of sparse columns individually by including the field name in the SELECT list. For example, the following statement produces the results that Figure 2 shows:

SELECT \\[DogID\\],\\[Name\\],\\[LastRabies\\],
  \\[Sleddog\\],\\[Handedness\\],\\[BirthDate\\],
  \\[BirthYear\\],\\[DeathDate\\],\\[Weight\\],
  \\[Leader\\],\\[Rescue\\],\\[OnGoingDrugs\\],
  \\[SpayNeuter\\] FROM DogCS

Similarly, you can explicitly retrieve the column set with the following statement:

SELECT \\[DogID\\],\\[Name\\],\\[Characteristics\\]
  FROM DogCS

You can also mix and match sparse columns and the column set in the same SELECT statement. Then you could access the data in different forms for use in client applications. The following statement retrieves the Handedness, BirthDate, and BirthYear sparse columns, as well as the column set, and produces the results that Figure 3 shows:

SELECT \\[DogID\\],\\[Name\\],\\[Handedness\\],
  \\[BirthDate\\],\\[BirthYear\\],
  \\[Characteristics\\] FROM DogCS

Figure 3: Results of including sparse columns and the column set in the same query

Column sets really become useful when you insert or update data. You can provide an XML snippet as the value of the column set, which populates the associated sparse columns. For example, the following code adds the information for Raja:

INSERT INTO DogCS (\\[Name\\],\\[LastRabies\\],
  \\[SledDog\\],\\[Rescue\\],\\[SpayNeuter\\],
    Characteristics)
  VALUES ('Raja', '11/5/2008', 1, 1, 1,
    '2004-08-30
    53');

Instead of providing a value for each sparse column, the code provides a single value for the column set in order to populate the sparse columns, in this case Raja's birthdate and weight. The following SELECT statement produces the results that Figure 4 shows:

SELECT \\[DogID\\], \\[Name\\], BirthDate,
  Handedness, \\[Characteristics\\]
  FROM DogCS WHERE Name = 'Raja';

Figure 4: Results of using a column set with INSERT to update a record

Notice that the BirthDate field has the data from the column set value, but because no value was set for Handedness, that field is null.

You can also update existing sparse column values using the column set. The following code updates Raja's record and produces the results that Figure 5 shows:

UPDATE DogCS SET Characteristics =
  '531'
  WHERE Name = 'Raja';
SELECT \\[DogID\\],\\[Name\\],BirthDate,
  \\[Weight\\],Leader,\\[Characteristics\\]
  FROM DogCS WHERE Name = 'Raja';

Figure 5: Results of using a column set with an UPDATE statement to update a record

But what happened to Raja's birthdate from the INSERT statement? The UPDATE statement set the field to null because the code didn't include it in the value of the column set. So you can update values using a column set, but it's important to include all non-null values in the value of the column set field. SQL Server sets to null any sparse columns missing from the column set value.

One limitation of updating data using column sets is that you can't update the value of a column set and any of the individual sparse column values in the same statement. If you try to, you'll receive an error message: The target column list of an INSERT, UPDATE, or MERGE statement cannot contain both a sparse column and the column set that contains the sparse column. Rewrite the statement to include either the sparse column or the column set, but not both. In other words, write different statements to update the different fields or use a single statement and include all sparse fields in the column set.

If you update the column set field and include empty tags in either an UPDATE or INSERT statement, you'll get the default values for each of those sparse columns' data types as the non-null value in each column. (You can mix and match empty tags and tags with values in the same UPDATE or INSERT statement; the fields with empty tags will get the default values, and the others will get the specified values.) For example, the code in Listing 2 (below) uses empty tags for several of the sparse columns in the table, then selects the data. As Figure 6 shows, string fields have an empty string, numeric fields have zero, and date fields have the date 1/1/1900. None of the fields have null values.

Figure 6: Results of inserting data with empty tags in a column set

The column set field is the XML data type, so you can use all the methods of the XML data type to query and manipulate the data. For example, the following code uses the query() method of the XML data type to extract the weight node from the XML data:

SELECT Characteristics.query('/Weight')
  FROM DogCS WHERE Name = 'Raja'

Column Set Security

The security model for a column set and its underlying sparse columns is similar to that for tables and columns, except that for a column set it's a grouping relationship rather than a container. You can set access security on a column set field like you do for any other field in a table. When accessing data through a column set, SQL Server checks the security on the column set field as you'd expect, then honors any DENY on each underlying sparse column.

But here's where it gets interesting. A GRANT or REVOKE of the SELECT, INSERT, UPDATE, DELETE, or REFERENCES permissions on a column set column doesn't propagate to sparse columns. But a DENY does propagate. SELECT, INSERT, UPDATE, and DELETE statements require permissions on a column set column as well as the corresponding permission on all sparse columns, even if you aren't changing a particular sparse column! Finally, executing a REVOKE on sparse columns or a column set column defaults security to the parent.

Column set security can be a bit confusing. But after you play around with it, the choices Microsoft has made will make sense.

Filtered Indexes

Filtered indexes are a new feature in SQL Server 2008 that you can use with sparse columns—although you might find them helpful even if a table doesn't have sparse columns. A filtered index is an index with a WHERE clause that lets you index a subset of the rows in a table, which is very useful with sparse columns because most of the data in the column is null. You can create a filtered index on the sparse column that includes only the non-null data. Doing so optimizes the use of tables with sparse columns.

As with other indexes, the query optimizer uses a filtered index only if the execution plan the optimizer develops for a query will be more efficient with the index than without. The optimizer is more likely to use the index if the query's WHERE clause matches that used to define the filtered index, but there are no guarantees. If it can use the filtered index, the query is likely to execute far more efficiently because SQL Server has to process less data to find the desired rows. There are three primary advantages to using filtered indexes:

They improve query performance, in part by enhancing the quality of the execution plan.
The maintenance costs for a filtered index are smaller because the index needs updating only when the data it covers changes, not for changes to every row in the table.
The index requires less storage space because it covers only a small set of rows. In fact, replacing a full-table, nonclustered index with multiple filtered indexes on different subsets of the data probably won't increase the required storage space.

In general, you should consider using a filtered index when the number of rows covered by the index is small in relation to the number of rows in the table. As that ratio grows, it starts to become better to use a standard, nonfiltered index. Use 50 percent as an initial rule of thumb, but test the performance benefits with your data. Use filtered indexes in these kinds of scenarios:

for columns where most of the data in a field is null, such as with sparse columns; in other words, where the table includes a small number of non-null values in the field
for columns that represent categories of data, such as a part category key
for columns with distinct ranges of values, such as dates broken down into months, quarters, or years

The WHERE clause in a filtered index definition can use only simple comparisons. This means that it can't reference computed, user data type, spatial, or hierarchyid fields. You also can't use LIKE for pattern matches, although you can use IN, IS, and IS NOT. So there are some limitations on the WHERE clause.

Working with Filtered Indexes

Let's look at an example of using a filtered index. The following statement creates a filtered index on the DogCS table:

CREATE INDEX fiDogCSWeight
  ON DogCS(\\[Weight\\])
  WHERE \\[Weight\\] > 40;

This filtered index indexes only the rows where the dog's weight is greater than 40 pounds, which automatically excludes all the rows with a null value in that field. After you create the index, you can execute the following statement to attempt to use the index:

SELECT * FROM DogCS WHERE \\[Weight\\] > 40

If you examine the estimated or actual execution plan for this SELECT statement—which Figure 7 shows—you'll find that SQL Server doesn't use the index. Instead, it uses a clustered index scan on the primary key to visit each row in the table. There are too few rows in the DogCS table to make the use of the index beneficial to the query.

Figure 7: Estimated execution plan for a query on a small table with sparse columns and a filtered index

To explore the benefits of using filtered indexes, let's look at an example using the AdventureWorks2008 Sales.SalesOrderHeader table. It has a CurrencyRateID field that has very few values. This field isn't defined as a sparse column, but might be a good candidate for making sparse.

The following code creates a filtered index on the CurrencyRateID field in the table, including all rows with a non-null value in the field:

CREATE NONCLUSTERED INDEX
  fiSOHCurrencyNotNull ON
  Sales.SalesOrderHeader(CurrencyRateID)
  WHERE CurrencyRateID IS NOT NULL;

Next, we run the following SELECT statement:

SELECT SalesOrderID, CurrencyRateID,
  Comment FROM Sales.SalesOrderHeader
  WHERE CurrencyRateID = 6266;

Figure 8 shows the estimated execution plan. You can see that the query is able to benefit from the filtered index on the data.

Figure 8: Execution plan for query that uses a filtered index

You're not limited to using filtered indexes with just the non-null data in a field. You can also create an index on ranges of values. For example, the following statement creates an index on just the rows in the table where the SubTotal field is greater than $20,000:

CREATE NONCLUSTERED INDEX
  fiSOHSubTotalOver20000
  ON Sales.SalesOrderHeader(SubTotal)
  WHERE SubTotal > 20000;

Here are a couple of queries that filter the data based on the SubTotal field:

SELECT * FROM Sales.SalesOrderHeader
  WHERE SubTotal > 30000;

SELECT * FROM Sales.SalesOrderHeader
  WHERE SubTotal > 21000
  AND SubTotal < 23000;

Figure 9 shows the estimated execution plans for these two SELECT statements. Notice that the first statement doesn't use the index, but the second one does. The first query probably returns too many rows (1,364) to make the use of the index worthwhile. But the second query returns a much smaller set of rows (44), so the query is apparently more efficient when using the index.

Figure 9: Estimated execution plans for two queries on a field with a filtered index

You probably won't want to use filtered indexes all the time, but if you have one of the scenarios that fit, they can make a big difference in performance. As with all indexes, you'll need to take a careful look at the nature of your data and the patterns of usage in your applications. Yet again, SQL Server mirrors life: it's full of tradeoffs.

Useful Tools

Sparse columns, column sets, and filtered indexes are useful tools that can make more efficient use of storage space on the server and improve the performance of your databases and applications. But you'll need to understand how they work and test carefully to make sure they work for your data and with how your applications make use of the data.

Listing 1: Creating the DogCS Table and Inserting Data

CREATE TABLE DogCS
(
  DogID INT NOT NULL PRIMARY KEY IDENTITY(1,1),
  Name NVARCHAR(20) NOT NULL,
  LastRabies DATE NULL,
  Sleddog BIT NOT NULL,
  Handedness NVARCHAR(5) SPARSE NULL,
  BirthDate DATE SPARSE NULL,
  BirthYear INT SPARSE NULL,
  DeathDate DATE SPARSE NULL,
  \\[Weight\\] INT SPARSE NULL,
  Leader BIT SPARSE NULL,
  Rescue BIT NULL,
  OnGoingDrugs NVARCHAR(50) SPARSE NULL,
  SpayNeuter BIT NULL,
  Characteristics XML COLUMN_SET FOR ALL_SPARSE_COLUMNS
)
INSERT INTO DogCS (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthDate\\],
  \\[Weight\\], \\[Rescue\\], \\[OnGoingDrugs\\], \\[SpayNeuter\\])
  VALUES ('Mardy', '11/3/2005', 0, '6/30/1997', 62, 1,
  'Metacam, Phenylpropanolamine, Rehab Forte', 1);
INSERT INTO DogCS (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthYear\\],
  \\[Leader\\], \\[Rescue\\], \\[SpayNeuter\\])
  VALUES ('Izzi', '11/3/2005', 1,  2001, 0, 1, 1);
INSERT INTO DogCS (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[Rescue\\],
  \\[OnGoingDrugs\\], \\[SpayNeuter\\])
  VALUES ('Jewel', '9/23/2007', 0, 1, 'Rehab Forte', 1);
INSERT INTO DogCS (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthYear\\],
  \\[Leader\\], \\[Rescue\\], \\[SpayNeuter\\])
  VALUES ('Casper', '10/17/2007', 1, 2002, 1, 1, 1);
INSERT INTO DogCS (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthYear\\],
  \\[Weight\\], \\[Leader\\], \\[Rescue\\], \\[SpayNeuter\\])
  VALUES ('Chance', '9/23/2007', 1, 2002, 36, 1, 1, 1);
INSERT INTO DogCS (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthDate\\],
  \\[Weight\\], \\[Leader\\], \\[Rescue\\], \\[OnGoingDrugs\\], \\[SpayNeuter\\])
  VALUES ('Daikon', '10/17/2007', 1, '2/14/1997', 50, 1, 0,
  'Potassium bromide, DMG', 1)
;

Listing 2: Inserting Data with Empty Tags and Selecting the Data

INSERT INTO DogCS (\\[Name\\], \\[LastRabies\\], \\[SledDog\\], \\[Rescue\\], \\[SpayNeuter\\], 
Characteristics)
  VALUES ('Star', '11/5/2008', 0, 0, 1, 
    '');

SELECT \\[DogID\\],\\[Name\\],\\[LastRabies\\],\\[Sleddog\\],\\[Handedness\\],\\[BirthDate\\],\\[BirthYear\\],
  \\[DeathDate\\],\\[Weight\\],\\[Leader\\],\\[Rescue\\],\\[OnGoingDrugs\\],\\[SpayNeuter\\]
  FROM DogCS WHERE Name = 'Star'

Comments

Plain text