Indexing the Data Warehouse

Indexing a data warehouse is tricky. If you have too few indexes, the data loads quickly but the query response is slow. If you have too many indexes, the data loads slowly and your storage requirements go through the roof but the query response is good. Indexing in any database, transactional or warehouse, most often reduces the length of time it takes to see query results. This is especially true with large tables and complex queries that involve table joins. (See also, "Design your Data Warehouse for Performance" and "Data Warehouse Workloads and Use Cases").

Some of the variables that you’ll want to take into account when indexing the data warehouse are the type of data warehouse you have (i.e., primarily archive or primarily near real-time), how large the dimensions and fact tables are (and if the fact tables are partitioned), who will be accessing the data and how they'll do so, and whether access will be ad hoc or via structured application interfaces. These variables will determine how your indexing scheme should be structured. Here’s a simple plan for indexing the relational tables that comprise a portion of your data warehouse. (Although I only explain how to index dimensions and fact tables in this article, I explain how to index the staging database in the Web-exclusive sidebar “Indexing the Staging Database.”) Note that the relational tables are those that are managed by SQL Server’s relational data engine, not those managed by the SQL Server Analysis Services (SSAS) engine.

Indexing Dimensions

You’ll want to index the dimension key (primary key), which is a surrogate key, not a “natural” or transactional key such as customer name or customer ID. Note that you shouldn’t cluster on the dimension key.

The dimension will contain a natural or transactional key (e.g., a transaction number or identifier), which we’ll call a business key, from the source system. Although the business key might not be unique—as in the case of a type 2 response to slowly changing dimensions—create a clustered index on the identity column, which you can see in Figure 1. The Customer and the Product dimensions have a clustered index built on the business key. By clustering on this key, you can enhance the query response when the business key is used in the WHERE clause. The expression in the WHERE clause is often used to search the dimensional data, and having the dimension records pre-sorted makes the query response faster.

Clustering by the business key might also help you avoid lock escalation (i.e., row to table, intent-exclusive to exclusive) during the extraction, transformation, and loading (ETL) process, which could happen if the surrogate key was the cluster key and all the rows were being added at the end of the file. If the exclusive locks escalated from row to table, that would block read, other ETL load, or utility operations on the table and cause diminished query response and even application timeouts. If your users and applications can live with some latency, you can get around this problem by having them query database snapshots and reserving the original database exclusively for data loads.

In Figure 1, the Date dimension and the Time dimension have no external datasource or business key. Instead of creating an identity-style primary key for these two tables, consider using a smart key, with a YYYYMMDD format for Date and an HHMMSSSSS format for Time (you can use fewer second positions, depending on how fine a time granularity you need to measure), and clustering on it. The values will maintain index order, range queries will be simplified in the fact table, and you’ll need one less join when querying because the primary key will contain the date (or time).

For large type 2 slowly changing dimensions (i.e., where you add a new row to record the change), you might want to create a four-part non-clustered index that includes the business key, the record begin date, the record end date, and the surrogate key. For efficiency and to prevent escalating storage requirements, INCLUDE the record end date and the surrogate key when creating the index instead of making them part of the index key, as shown in the following command:

CREATE NONCLUSTERED INDEX MyDim_CoveringIndex
  ON (NaturalKEY, RecordStartDate)
INCLUDE ( RecordEndDate, SurrogateKEY);

This command creates a covering index that can be useful during the ETL process and load operations and for historical queries. When you make RecordEnd- Date and SurrogateKEY INCLUDEs instead of part of the index key, the SQL Server engine stores these two values at only the leaf level of the index tree, thus reducing the storage requirements. By having these two columns in the index (i.e., creating a covering index), the SQL Server relational engine can get the data that it needs solely from the index during load and some query operations, without having to access data from the underlying dimension.

If there are other columns in the dimension that will be used continuously for searching, sorting, or grouping, create non-clustered indexes on those columns as you would in a transactional database. If there’s an embedded hierarchy in a dimension, such as the Category- SubCategory-ProductID hierarchy in the Product dimension, then consider indexing the components of the hierarchy if it will enhance query performance and won’t inhibit data loading.

Indexing the Fact Table

Indexing the fact table is similar to indexing a dimension, although you must account for partitioning. You’ll want to index and cluster on the date key or a combined date plus time. Because business intelligence (BI) analysis always seems to involve a date/time component, the fact table will have a date (or datetime) key, and clustering on this key will help with cube-building. Also, if the data records are already stored in date or datetime order, historical queries will have an execution advantage. If the fact table has more than one date or datetime column, cluster on the column that’s used most often for querying or cube-building.

If the fact table is partitioned on the date column, use that column as the clustering key. When you see the same column to create the clustered index that you used to create the partitions and creating the index in the same file group that holds the partitioned fact table, SQL Server will automatically partition the index the same way that the fact table is partitioned (i.e., the index will have the same partitioning function and column as the fact table). When the index is partitioned the same way the fact table is partitioned, the table and its index are said to be aligned, which makes for an optimal operational situation, especially if you anticipate creating additional partitions or making frequent partition switches.

Next, create a non-clustered index on each of the foreign keys in the fact table, and consider combining the foreign key and the date key, in that order, similar to CustomerKEY + DateKEY in Figure 1. Creating a non-clustered key on the foreign keys works especially well if one or more of the associated dimensions is a type 2 slowly changing dimension. Rows with the same foreign key value will be searched in ascending date order, which will enhance the historical query response. Note that you’ll want to retain relational integrity when dealing with the foreign keys. For more information about how to do so, see the Web-exclusive sidebar “Retaining Relational Integrity in the Data Warehouse.”

Modifying Your Indexing Scheme

Over time, your data warehouse will change to accommodate what’s happening in your organization, and you’ll have to modify your indexing scheme. Most data warehouse/BI systems will access these relational tables directly, so you can use tried-and-true transactional methods for tuning indexes, such as evaluating the query and data mix and adjusting it accordingly. If your relational data warehouse is used only to stage SSAS structures, then you might not need any more indexes than those we’ve talked about. SSAS tends to use the same queries over and over again, so you can run the Index Tuning Wizard and tune exactly for that set of queries. Start simple, evaluate thoroughly, and build conservatively when indexing your data warehouse.

Comments

Plain text