Skip navigation
Efficient Data Management in SQL Server 2008, Part 1

Efficient Data Management in SQL Server 2008, Part 1

Use sparse columns to help reduce data storage requirements

Even in this era of dirt-cheap terabyte storage devices, efficient data storage and storage management is still a prime concern for both DBAs and developers. So it's particularly vexing that in a typical SQL Server database, null data takes up storage space—potentially several bytes for each field in each row, depending on the data type of the field. SQL Server has to record that there is no data in some way.

Related: Efficient Data Management in SQL Server 2008, Part 2

Sparse columns are a tool Microsoft provides in SQL Server 2008 to help minimize data storage space. Sparse columns provide an efficient way to store data that's predominantly null because they require no disk storage space for the null values. This situation sounds great, but there's a cost: In a field marked as sparse, non-null data takes up a bit more storage space than if the field weren't sparse—either two or four additional bytes, depending on the data type of the field. The Microsoft article "Using Sparse Columns" has all the details, but it boils down to four bytes for most data types and either two or four bytes for data-dependent-length data types, such as varchar, varbinary, XML, and hierarchyid.

Microsoft recommends using sparse columns for fields in a table when the overall storage space savings will be 20 to 40 percent. Sparse columns usually make sense when there's a lot of null data in the field, which often happens when the database supports an object-relational mapping architecture where several concrete classes inherit from an abstract class and the data is all saved in a single table. In that case, not all of the derived concrete classes use all the available fields, and you're wasting storage space even with null data. (You could implement an object-relational mapping support database with multiple tables, but it would be more complex and harder to maintain.)

Restrictions on Sparse Columns

As you'll see later in this article, making a column sparse is as easy as using the SPARSE keyword in a CREATE or ALTER TABLE statement. But SQL Server has a number of restrictions on the use of sparse columns. The most important restrictions are:

  • A sparse column must be nullable, of course. It makes no sense to make a non-null column sparse because it can never include null data.
  • You can't use ROWGUIDCOL or IDENTITY properties on the column.
  • The column can't have a default value. This restriction makes sense when you consider that a default value makes it less likely that you'll have null data in the field.
  • You can't use sparse columns in a user-defined table type.
  • If a table has sparse columns, you can't compress it at either the row or page level.

Those restrictions aren't all that onerous, but there are a few other things to consider. You can't use sparse columns with all data types; forbidden types are geography, geometry, text, ntext, timestamp, image, and user-defined data types. You can't make varbinary(max) fields that use FILESTREAM storage sparse. You also can't mark a computed column as sparse, but you can use a sparse column in the calculation of a computed column.

You could face difficulties if you need to change the schema of a table that contains sparse columns. Changing the schema requires some work space within a row, so if the row size is near 4,009 bytes (close to half the maximum size), adding or removing a sparse column can fail. SQL Server needs to make a copy of the data, but might not have room to do it. In practice, you can save the data to a temporary table, drop and recreate the table with the sparse columns, then move the data back—a bit of a hassle, but not the end of the world.

A table with sparse columns has a slightly reduced maximum row size: 8,018 instead of the usual 8,060. SQL Server needs the additional space for a sparse vector that it uses to manage the sparse columns in the table. Additionally, all fixed-length non-null sparse columns have a maximum length of 8,019 bytes. But the really good news is that you can have up to 30,000 sparse columns in a table, so you can get just about as carried away as you want with a table schema.

All of these restrictions are manageable and generally don't severely restrict the use of sparse columns.

Column Sets

A table with one or more sparse columns can also have a column set field. A column set is an alternative way to view and work with all of the sparse columns in a table. A column set is an untyped XML column that you can use to get and set all sparse column values as a set. It's similar to a computed column in that the column set isn't physically stored in the table, but you can update the values of the sparse columns through the column set.

A column set simplifies working with sparse columns in a table because it aggregates the underlying relational columns into an XML snippet. Even if your table has only a few sparse columns, a column set can simplify working with the data. I'll show you how this works in part 2 of this article.

Like sparse columns, column sets have a few restrictions. Some of the more significant restrictions you're likely to face include:

  • You can't change a column set after it exists in the table. Instead, you must drop it, then re-create it.
  • You can't add a column set to a table that already has sparse columns. This restriction is probably the most severe.
  • You can have only one column set per table; the column set encapsulates all of the sparse columns in that table.
  • There can be no constraints or default values on the column. Remember, it's essentially an updateable computed column that stores no data directly.

There are other restrictions documented in the Microsoft article "Using Column Sets," so be sure to check out the details.

Good Idea or Bad Design?

Sparse columns can solve certain kinds of problems with database design, although they might offend database design purists. You should never use sparse columns as an excuse for poor database and table design. Some uses of sparse columns denormalize data, which you should only do with good reason. Also, look at the performance and storage changes for your data. You might see an improvement in either—or you might not—depending on the nature of the data and the patterns of use in your applications.

Always use careful database design and implementation. As cool as sparse columns are, they aren't appropriate for every scenario, particularly when you’re tempted to violate normalization rules in order to cram more fields into a table.

Working with Sparse Columns

To explore the features of sparse columns, we'll work with a table called Dog, which contains information about the dogs in my household, stored in the tempdb database. I have 31 dogs; most of them are working or retired sled dogs here in Fairbanks, Alaska. The code in Listing 1 creates the table.

CREATE TABLE Dog
(
  DogID INT NOT NULL PRIMARY KEY IDENTITY(1,1),
  Name NVARCHAR(20) NOT NULL,
  LastRabies DATE NULL,
  Sleddog BIT NOT NULL,
  Handedness NVARCHAR(5) SPARSE NULL,
  BirthDate DATE SPARSE NULL,
  BirthYear INT SPARSE NULL,
  DeathDate DATE SPARSE NULL,
  \\[Weight\\] INT SPARSE NULL,
  Leader BIT SPARSE NULL,
  Rescue BIT NULL,
  OnGoingDrugs NVARCHAR(50) SPARSE NULL,
  SpayNeuter BIT NULL
)

Some of the dogs are pets, others are working sled dogs, and all have various characteristics that I want to record. Some columns apply to all dogs; others apply just to the working sled dogs. And some of the columns apply in theory to all the dogs, but contain data only for a few. Table 1 lists the columns and some information that forms the basis for deciding whether each column is a candidate to be a sparse column.

As you can see, there are some decisions inherent in the design that might be good or bad. But let's go with this structure. After the table is created, you can populate it using regular INSERT statements. The code in Listing 2 inserts the data for several dogs. A T-SQL row constructor wouldn't work because each insert must have the same fields. But in this case, not all dogs have information for all fields, and it would be cumbersome to include all the null values necessary to create each row.

INSERT INTO Dog (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthDate\\],
  \\[Weight\\], \\[Rescue\\], \\[OnGoingDrugs\\], \\[SpayNeuter\\])
  VALUES ('Mardy', '11/3/2005', 0, '6/30/1997', 62, 1,
  'Metacam, Phenylpropanolamine, Rehab Forte', 1);
INSERT INTO Dog (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthYear\\],
  \\[Leader\\], \\[Rescue\\], \\[SpayNeuter\\])
  VALUES ('Izzi', '11/3/2005', 1,  2001, 0, 1, 1);
INSERT INTO Dog (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[Rescue\\],
  \\[OnGoingDrugs\\], \\[SpayNeuter\\])
  VALUES ('Jewel', '9/23/2007', 0, 1, 'Rehab Forte', 1);
INSERT INTO Dog (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthYear\\],
  \\[Leader\\], \\[Rescue\\], \\[SpayNeuter\\])
  VALUES ('Casper', '10/17/2007', 1, 2002, 1, 1, 1);
INSERT INTO Dog (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthYear\\],
  \\[Weight\\], \\[Leader\\], \\[Rescue\\], \\[SpayNeuter\\])
  VALUES ('Chance', '9/23/2007', 1, 2002, 36, 1, 1, 1);
INSERT INTO Dog (\\[Name\\], \\[LastRabies\\], \\[Sleddog\\], \\[BirthDate\\],
  \\[Weight\\], \\[Leader\\], \\[Rescue\\], \\[OnGoingDrugs\\], \\[SpayNeuter\\])
  VALUES ('Daikon', '10/17/2007', 1, '2/14/1997', 50, 1, 0,
  'Potassium bromide, DMG', 1);

Figure 1 shows the contents of the Dog table after you've inserted the data. Notice that the sparse columns have very little non-null data, so we mostly made the right decisions about which columns to make sparse. The one column that we might want to reconsider is the Leader column, which has only two nulls in the 6 rows of data. But it might make sense to make it sparse when all the dogs are inserted into the table, instead of just this sample.

Figure 1: The Dog table with inserted data
Figure 1: The Dog table with inserted data

There's nothing functionally different about a table that includes sparse columns; only the storage has changed. All the typical INSERT, UPDATE, and DELETE statements work as with any other table that doesn't have sparse columns, and you can wrap operations in transactions as usual. For example, the following statements works on the Dog table as you'd expect:

SELECT * FROM Dog WHERE Leader = 1;

SELECT Name, Handedness, Rescue
  FROM Dog WHERE Leader IS NOT NULL;

Even this slightly more complicated example has the expected results:

BEGIN TRANSACTION
  DELETE FROM Dog
    WHERE \\[Weight\\] < 40 OR
    DeathDate IS NOT NULL;
  SELECT * FROM Dog;
ROLLBACK TRANSACTION

Figure 2 shows the results of executing the three previous statements.

Figure 2: The results of executing statements on the Dog table
Figure 2: The results of executing statements on the Dog table

You won't see much, if any, difference in storage requirements in this trivially small table. But start adding 10 or 100 million records and you'll see some significant storage savings, particularly if the ratio of non-null to null values in the sparse columns stays small.

Work More Efficiently

In part 2 of this article, I'll explore column sets and how you can use them to manage access to a large number of sparse columns in a table, as well as explore filtered indexes. These indexes aren't limited to use with sparse columns; they let you create a table index on a subset of table rows. Along with sparse columns, you'll find filtered indexes to be another tool in SQL Server 2008 to help you work more efficiently.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish