Improve the Performance of Data Warehouse Queries with Columnstore Indexes

The columnstore index is one of the most exciting new features available in SQL Server 2012. Combined with new query processing algorithms, columnstore indexes are designed to improve the performance of data warehouse queries by several orders of magnitude. I'll explain what a columnstore index is, how it improves the performance of your data warehouse queries, and some of the limitations on the current release. I'll also provide a few examples for how to start using this new technology.

What Is a Columnstore Index?

The traditional storage approach used by SQL Server and most other databases is to store rows on a data page, which is called a row store. Row stores in SQL Server include heaps (i.e., tables with unordered data) and b-tree structures (e.g., clustered and standard nonclustered indexes). Columnstore indexes dedicate entire database pages to store data from a single column. A columnstore index is actually divided into segments, which consist of multiple pages, and each segment is stored in SQL Server as a separate BLOB.

The column-based storage approach is not new and has been used before on some other databases, although Microsoft claims that SQL Server is the first major database vendor to offer a pure column-based approach. Columnstore indexes are based on Microsoft's VertiPaq technology, which is also used in SQL Server Analysis Services (SSAS), PowerPivot for Excel, and SharePoint.

New query processing algorithms are an integral part of the columnstore index technology. These algorithms are designed to process data in batches and are very efficient at handling large amounts of data. A batch is stored as a vector in a separate area of memory and typically represents about 1,000 rows of data. Batch processing introduces a set of new operators, which process a batch of rows at a time. This is different from the standard query processing mode, where query operators process only one row at a time.

Columnstore indexes are designed for data warehouses, where queries typically scan many records but use only a few columns. They can be created in the fact tables of a data warehouse, although in some cases, they can be useful for extremely large dimension tables, too (e.g., those containing more than 10 million rows).

Reasons for the Improvement Boost

There are several reasons for the increased performance of data warehouse queries using columnstore indexes:

1. Highly compressed data. With columnstore indexes, data from the same column is stored contiguously and will usually have repeated or similar values, which can often be compressed more effectively than in a row store. Columnstore compression isn't the same as the row and page compression available in SQL Server 2008; the data is compressed using VertiPaq compression algorithms. Heavily compressed data improves performance, because it requires fewer disk I/O operations. In addition, more data can fit into memory.

2. Reduced I/O. In a row store, SQL Server always reads all the columns of a row, including the columns that aren't used by the query. Because typical star join queries use only 10 percent to 15 percent of the columns in a fact table, using a columnstore to read only those columns can represent a savings of 85 percent to 90 percent in disk I/O.

3. Batch processing. As mentioned previously, the new batch execution mode processes data in batches, which is more efficient for large amounts of data.

4. Improved buffer pool usage. Having highly compressed data and reading only the columns that are required improves buffer pool usage, because more data can be kept in memory.

5. Segment elimination. As mentioned previously, columnstore indexes are divided into segments. SQL Server maintains metadata containing the minimum and maximum values of the columns in each segment. These values are checked by the storage engine and compared against query filter conditions to avoid reading segments that are not needed.

Columnstore indexes can provide other benefits as well, such as reducing or eliminating the need to rely on prebuilt aggregates (e.g., summary tables, indexed views, OLAP cubes) because data warehouse queries can now be executed faster than before. Columnstore indexes are also more flexible than aggregates. If a query changes, the columnstore index will still be useful, whereas a prebuilt aggregate might no longer be able to support it. In addition, the only action required to take advantage of these performance improvements is to define a columnstore index in your fact tables. There is no need to change your queries, use any specific syntax, or do anything else.

However, keep in mind that although columnstores can speed up the performance of star join queries in data warehouses, they might not be a good choice in other situations where row stores might perform better. Columnstores are optimized for data warehouse workloads and not intended as a replacement for the operators currently used in OLTP queries.

The query optimizer will select the correct index and operations according to the query, so you can get the best of both worlds. For example, in the case of very selective queries returning only a few records, the query optimizer might rely on row stores, such as clustered or nonclustered indexes, to find the records quickly. There are no seeks in columnstore indexes. And even in the rare cases in which the query optimizer is not making a good selection, you still have the choice of using hints.

Optimizing Queries

When optimizing a query, the query optimizer can use any of the available access methods, which now include columnstore indexes. As always, this will be a cost-based decision. You still have the choice of using hints when the query optimizer isn't giving you a good execution plan. When you want to force the use of a columnstore index, you can use the existing INDEX hint with the columnstore index's name or ID. When you want to prevent the use of a columnstore index, you can use the new IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX hint.

As previously mentioned, query operators can now run in batch or row mode. The query optimizer decides which mode to use. Operators usually run in row execution mode, but when a columnstore index is available, some operators might run in batch mode. Currently, the operators that can run in batch mode are Scan, Project, Filter, Hash Aggregate, Hash Join, and Batch Hash Table Build.

Execution plans will show both the estimated execution mode and actual execution mode. Most of the time, they'll have the same value -- either batch or row. A plan showing an estimated execution mode of batch and an actual execution mode of row might be evidence of a performance problem. Limitations on memory or threads can cause one operation to dynamically switch from batch mode to row mode, degrading the performance of the query execution.

If the estimated plan showed a parallel plan and the actual plan switched to serial, you can infer that there weren't enough threads. If the plan actually ran in parallel, you know that memory was the problem. Some of the new batch operators might use hash tables, which are required to fit in memory, so if not enough memory is available at execution time, SQL Server might dynamically switch the operation back to row mode, where standard hash tables can be used and can spill to disk. Switching to row execution mode due to memory limitations can be caused by bad cardinality estimations, so you might consider verifying and updating the statistics for your tables. Note that columnstore indexes don't have optimizer statistics of their own, so statistics refers to those belonging to the other objects used in the query.

Limitations

Perhaps the most noticeable limitation of columnstore indexes in SQL Server 2012 is that its data isn't updatable. In other words, no INSERT, DELETE, UPDATE, MERGE, or other update operations are allowed. Not being able to update data might seem like a huge disadvantage, but it might not be a big problem, as the target for this technology is data warehouses that are usually read-only most of the day and might just require an update once daily. A few workarounds to update data using columnstore indexes are explained in the next section. Updatable columnstore indexes are planned for a future release of SQL Server, according to Microsoft.

A second limitation is that columnstore indexes are nonclustered indexes, so they still require the main table, which could be either a clustered index or a heap. This mainly means that you'll end up with two copies of the same data. Microsoft has said that this limitation will go away in a future release of SQL Server, which will have a columnstore index as the main table.

Finally, some data types aren't allowed. According to SQL Server 2012 RC0 Books Online (BOL), the following data types can't be used in a columnstore index:

binary and varbinary
ntext, text, and image
varchar(max) and nvarchar(max)
uniqueidentifier
rowversion (and timestamp)
sql_variant
decimal (and numeric) with precision greater than 18 digits
datetimeoffset with scale greater than 2
CLR types (hierarchyid and spatial types)
xml

Updating Data

As mentioned previously, columnstore indexes aren't updateable in the current SQL Server 2012 release. If you need to update data in a columnstore index, there are some workarounds:

1. Drop or disable the columnstore index, update or load the new data, then create or build the columnstore index again.

2. Use partition switching. Let's examine two scenarios, depending on whether you want to load new data or update existing data. If you want to load new data, you can load it into a staging table, create the columnstore index, then switch the staging table into an empty partition in the main table. If you need to update existing data, you can switch a partition into a staging table, drop or disable the columnstore index, update the data, create or build the columnstore index again, then switch the staging table back into the main table.

3. Use UNION ALL. In this scenario, you have two tables: a fact table with a columnstore index and an updateable table with the same schema and the most current data. You query both tables at the same time using UNION ALL. You'll need to verify the execution plan to make sure that batch processing is being used by the heaviest operations. The work on the small table will be performed in row mode, but this shouldn't be a problem because it's only for a relatively small number of records. This workaround might not work in all cases, and you might need to write separate queries, subqueries, or common table expressions (CTEs) for each table that contributes to the UNION ALL operation, then combine all the results. Again, you'll need to verify the plan to make sure batch processing is being used on the most expensive operations.

Using the New Technology

If you'd like to try using a columnstore index, I created the CreateColumnstoreIndex.sql script, which you can download by clicking the link at top of this page. This script will work on either SQL Server 2012 Release Candidate 0 (RC0), which you can download from the Microsoft SQL Server website, or SQL Server 2012 Community Technology Preview 3 (CTP3). To follow this example, you'll also need the AdventureWorksDWDenali database, which is available at the Codeplex website.

To begin, run CreateColumnstoreIndex.sql. This script creates a new fact table named FactInternetSalesNew, copies some data into it, and creates a columnstore index that includes all the columns in the table. Note that when you create a columnstore index, the order of the column names specified in the CREATE INDEX statement doesn't matter because the data will be organized by the VertiPaq algorithm.

Note that you might get a memory grant error when running the CREATE INDEX statement. To troubleshoot this problem, you can implement one of the options discussed in the "What can I do if I do not have enough memory to build the columnstore index?" section of the Microsoft article "SQL Server Columnstore Index FAQ." For example, one of the options is to change the resource governor settings to allow the statement to access more memory.

After the table and index are created, try running the star join query shown in Listing 1.

SELECT d.CalendarYear,
SUM(SalesAmount) AS SalesTotal
FROM dbo.FactInternetSalesNew AS f
JOIN dbo.DimDate AS d
ON f.OrderDateKey = d.DateKey
GROUP BY d.CalendarYear
ORDER BY d.CalendarYear

Figure 1 shows part of the plan that's created. Notice the new Columnstore Index Scan operator on the right. The properties of the Columnstore Index Scan operator are shown in Figure 2. It includes some new properties not available in previous versions of SQL Server. As lines 3 and 4 in the property list show, the actual and estimated execution mode is Row. This execution mode was selected because the table isn't large enough to require batch execution mode. Also notice that the Storage property is ColumnStore (line 5); the only other choice is RowStore.

Figure 1: Execution plan showing the new Columnstore Index Scan operator

Figure 2: Columnstore Index Scan operator properties showing that the row execution mode is being used

Next, use the undocumented ROWCOUNT and PAGECOUNT options of the UPDATE STATISTICS statement to simulate a larger table by running the command

UPDATE STATISTICS

dbo.FactInternetSalesNew

WITH ROWCOUNT = 10000000,

PAGECOUNT = 1000000

Keep in mind that these options aren't supported. It's safe to use them here only because you'll delete the table at the end of the exercise. Do not use these options on tables that you would like to keep, even in a test or development environment.

After the UPDATE STATISTICS command is executed, the query processor will believe that the fact table has 10 million rows and 1 million database pages. Running the star join query in Listing 1 again will produce a parallel plan. Figure 3 shows part of the plan. Notice that it includes the new Batch Hash Table Build operator, whose purpose is to build the hash table used by the hash join on its left. If you look at the properties of the Columnstore Index Scan operator for this query (see Figure 4), you'll see that this time the batch execution mode is being used.

Figure 3: Execution plan showing the new Batch Hash Table Build operator

Figure 4: Columnstore Index Scan operator properties showing that the batch execution mode is being used

If you want to disallow the use of a columnstore index in a query, you can use the new IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX hint. For example, if you run the code in Listing 2, you'll get the plan shown in Figure 5.

SELECT d.CalendarYear,
SUM(SalesAmount) AS SalesTotal
FROM dbo.FactInternetSalesNew AS f
  JOIN dbo.DimDate AS d
  ON f.OrderDateKey = d.DateKey
GROUP BY d.CalendarYear
ORDER BY d.CalendarYear
OPTION (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX)

As you can see, the plan is now directly using the FactInternetSalesNew base table, without touching the existing columnstore index.

Figure 5: Execution plan showing the result from using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX hint

Because the number of records and pages in the FactInternetSalesNew table was altered for this exercise, drop this table after you have finished testing it. Use the command

DROP TABLE dbo.FactInternetSalesNew

To Learn More

I provided only a brief introduction to columnstore indexes. If you'd like to learn more about them, check out the resources in the Learning Path.

LEARNING PATH

For more information about columnstore indexes, see:

Columnstore Indexes

Columnstore Indexes: A New Feature in SQL Server known as Project "Apollo"

Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0

SQL Server Columnstore Index FAQ

SQL Server Columnstore Performance Tuning

Understanding Segment Elimination

Comments

Plain text