SQL Server 2005, and especially Analysis Services 2005, has embraced industry-standard dimensional modeling like a long-lost friend. When you build Analysis Services databases in SQL Server 2000, you have to compromise in creating the dimensional model. Specifically, you have to create multiple OLAP virtual dimensions and use member properties to capture attributes that don't fit into hierarchical dimensions.
SQL Server 2005 Analysis Services uses attribute-based dimensions, so every attribute in a dimension is automatically treated as a standalone hierarchy. Now, you can use the metadata layer that defines Analysis Services cubes (the Unified Dimensional Model—UDM) to define one customer dimension instead of a half-dozen artificially separate dimensions. One benefit of this deeper dimensional support is that data warehouse and BI system developers no longer need to convert standard dimensional-modeling techniques to the limited world view of previous SQL Server versions. Instead, you can build dimensions that realistically represent how your business operates and can change as your business changes. In this article, we define dimensional models, describe the basic elements and techniques that support them, and offer a dimensionally based data architecture for your data warehouse and BI system.
Facts and Dimensions
What do we mean by dimensional modeling? Basically, a dimensional model contains two entity types: facts and dimensions. We keep the measurements of a given business process in entities called fact tables in the relational world, or measure groups in Analysis Services. In its classic form, a relational fact table contains foreign keys and numeric fact fields that measure the business process. Essentially, a fact table is a normalized structure for efficient storage and fast access of large, relational data sets.
For example, many organizations have a business process called Orders that tracks customer requests for goods or services. Figure 1 shows a simplified OrderLineItems fact table model based on the Adventure-Works Cycles sample database in SQL Server 2005. The three non-key numeric fields in the fact table are the quantities, dollar values, and shipping costs of goods ordered: the measurements of the Orders business process. The foreign keys in the OrderLineItems fact table link to the second type of dimensional entity, dimension tables, which we'll discuss in a bit.
The level of detail in the fact table is called its grain. The granularity of the OrderLineItems fact table in Figure 1 is at the order line-item level.The finer the grain, or the lower the level of detail, the more flexibility your model will have. You can always roll the data up to a higher level if needed, but you can never drill down if you don't have the detail in your initial database. Our team designs fact tables at the lowest level of detail available, the atomic level. Depending on your organization's data volumes and your users' performance needs, you might not be able to provide atomic-level granularity, but make sure business users understand what they're losing in terms of flexibility.
You'll most likely end up with fact tables at different levels of granularity in your BI environment. For example, in an Orders Forecast table, you wouldn't want the same level of granularity as the OrderLineItems fact table. Forcasting orders by day, customer, and product is just too much work. Instead, the Orders Forecast table would probably be at the month and product level, with a customer level possibly thrown in for large customer accounts or businesses that have relatively few customers. The OrderLineItems table in Figure 1 has data that could be at two different levels. The OrderQuantity and OrderDollar values are at the line-item level, but the Shipping-Amount might be generated at the Orders table level. In Figure 1, we've allocated shipping costs to the line-item level to avoid the need for two fact tables. Such an allocation is generally a good solution as long as the business users support the allocation rules. In some cases, it doesn't make business sense to allocate items this way, and instead we use fact tables at two levels of granularity to describe a given business process.
One final note about fact tables: Notice that the OrderNumber and LineItemNumber in Figure 1 are not foreign keys for separate dimensions. Their purpose is to tie together the rows of each order. Because all the descriptive attributes of an order reside in their respective dimensions, Order-Number and LineItemNumber become degenerate dimensions (or fact dimensions in Analysis Services).
A quick glance at Figure 1's example data rows reveals that the fact table is meaningless by itself—it has no business context. Any given business process has several relatively independent entities or objects that participate in the process. In dimensional modeling, these objects are called dimensions and the properties of a dimension are called attributes. Attributes describe the members of the dimension and provide a rich business context for analytic purposes. For example, the AdventureWorks Product dimension might have several standalone attributes such as color, size, and weight. Most dimensions also have several attributes that relate to other attributes in a hierarchical fashion.The product dimension might have a hierarchy in which products are members of a group and groups are members of a category. Figure 2 shows a simplified Product dimension and some example rows for Adventure WorksCycles.
In the transaction system, attributes and hierarchies are often fully normalized into separate tables that use foreign keys to link to products. We build dimension tables on the relational platform by denormalizing the descriptive attributes and hierarchies into one table for each dimension. These denormalized dimensions still present the same information and relationships as the normalized model; nothing is lost except complexity.
Even with a simple product dimension like the one in Figure 2, it's easy to imagine how the attributes will be used as row and column headers in a query or report. For example, a user request to see order dollars by color and size would reveal order patterns between size and color. All of the natural ways end users describe their business should find their way into dimension attributes.
Figure 2's Product dimension seems to have two keys: ProductKey and ProductBusinessKey. ProductKey is the dimension table's primary key and is a surrogate key, usually an integer, which is added as part of the process of creating the dimension table. ProductBusinessKey is the key from the transaction system. Although including both keys might seem redundant, there are several compelling reasons to assign and manage your own surrogate keys:
- They help isolate the data warehouse and BI system from operational changes such as a merger, acquisition, or system migration.
- You can add dimension rows for values such as Not Applicable or Date TBD that aren't in the transaction system.
- Surrogate keys let you integrate data from multiple sources that might have different business keys for the same entity, such as a customer.
- In some cases, the simple integer surrogate key provides better join performance than a more complex alphanumeric key.
If those reasons aren't enough, you must use a surrogate key if you're planning to track changes in attribute values over time.
Managing Changes in Dimension Attributes
Even though we think of dimension attributes as fixed properties (e.g.,a product's color), some attribute values can change over time. We call this Slowly Changing Dimensions. Dimensional modeling offers two basic techniques for tracking attribute changes in dimensions. The first, a Type 1 change, is to simply overwrite the old value with the new one. The second way to track attribute changes, a Type 2 change, is a bit more complex. When a Type 2 attribute changes, you create a new dimension row with the new value and a new surrogate key, and mark the old row as no longer in effect. Going forward, the fact table uses the new surrogate key to link new fact measurements to the dimension row that contains the new attribute value. Fact rows created before the change will continue to join to the old dimension row with the old key and return the old attribute value. With a Type 2 change, all measurements before the change will be associated with the old value, and all measurements after the change will be associated with the new value.
We often see the technical people on BI teams dismiss Type 2 dimension-change management as too complicated. Fortunately, SQL Server 2005 Integration Services (SSIS) makes managing Type 2 dimensions relatively easy. It's essential that business users—not developers—decide which attributes must have historical tracking. For some analyses, it's vital for the business user to know an attribute's value at the time an event happened rather than the value that the attribute has today.
Bringing Together Facts and Dimensions
A business process dimensional model brings together the facts and dimensions for a given business process. For example, Figure 1's OrderLineItems fact table involves three dimensions: a dimension for the customer who placed the order, the order date, and the product the customer ordered. Each of these dimensions is independent of the business process. For example, a product can exist in inventory even if it never sells. Figure 3 shows the business process dimensional model for the OrderLineItems business process from Figure 1. (You might also see the term star schema used to describe the business process dimensional model; the terms are essentially interchangeable.)
The OrderDate and RequestedDate dimensions in Figure 3 are actually the same underlying dimension. We call this reuse of the same dimension in one model dimension role-playing. The underlying Date dimension exists in almost all business process dimensional models and can be reused through views, aliases, or physical copies of the base table. The Date dimension supports analyses that trend across dates or compare time periods (i.e., nearly all business analyses). Analysis Services 2005 calls this a Time dimension and provides a wizard and MDX functions to help you create date-based calculations.
Although a business-process dimensional model such as the orders model in Figure 3 is powerful from an analytic perspective, this power is dramatically multiplied when you extend it across the enterprise. Many dimensions are involved in multiple business processes. For example, the Product dimension at a company such as AdventureWorks Cycles is tied not only to orders, but also to manufacturing, inventory, shipping, and returns. These separate business-process dimensional models must use the same product dimension with the same keys to reliably support true, cross-business-process analysis. If your company's inventory managers want to calculate inventory turns, they sum the total sales quantity by product from orders, sum the inventory quantities at the end of the period by product from inventory, and divide the two to get inventory turns by product.This type of analysis works only if the two business processes use the exact same product dimension and the same keys; that is, they use a conformed dimension. Conformed dimensions are the cornerstone of the enterprise-enabled BI system.
Getting all the people in an enterprise to agree on conformed dimensions is an organizational challenge for the data modeler and data steward.The organization must agree on how to define its product,customer,employee, or facility dimension tables. The definition includes the list of attributes, attribute names, hierarchies, and the business rules needed to define and derive each attribute in the table. Getting this enterprise-wide agreement can be hard, political work, and the effort grows with the number of employees and divisions in your organization, but it's not optional. Creating conformed dimensions is the only way to ensure that you're comparing apples to apples (assuming you are selling apples).
The idea of reusing dimensions across multiple business processes leads us to the concept of the data warehouse bus matrix, which is a useful way to map out conformed dimensions and the business processes they support. The classic bus matrix has dimensions across the columns and business processes down the rows. Table 1 shows an abbreviated version of the AdventureWorks Cycles bus matrix, which gives us the sense that the Orders business process is more complicated than we've been led to believe. The bus matrix includes a Promotion dimension and an Employee dimension in addition to the Date, Product, and Customer dimensions we've already seen.
Ultimately, the bus matrix is the enterprise-BI data roadmap. You implement the BI system one business process at a time. As you implement a business process, build conformed dimensions that will support other business processes as they come online. In Analysis Services 2005, you see the same concept as the data warehouse bus matrix in the Dimension Usage tab of the Cube Designer.
A Dimensional Approach for SQL Server 2005
Given that SQL Server is so dimensionally oriented, what should your data warehouse and BI data architecture look like? We recommend creating a three-tiered data solution like the one that Figure 4 shows. This solution includes a staging or extraction, transformation, and loading (ETL) database; a relational dimensional database; and an OLAP cube in Analysis Services.
The ETL database is the construction site for the BI system. Business users generally aren't allowed access to this first tier. Within the ETL database, you manage slowly changing dimensions, assign surrogate keys, integrate data from multiple sources, and deal with business rules, data quality checks, and complex transformations. The underlying data model of the ETL database depends on many factors, including data volumes, referential integrity issues, data quality, the source system data model, and source system extract resources. For example, the ETL database might contain normalized versions of the dimension attributes and hierarchies to support the construction of the relational dimensional database and the Analysis Services database. Specifically, when Analysis Services builds a dimension, it issues a SELECT DISTINCT query against every attribute. Running these queries can be time-consuming for a large dimension table with dozens of attributes. This re-normalization is called snowflaking, and although it can be useful for maintenance and data load purposes, we encourage you to always present properly formed, denormalized dimensions to your users.
The second tier, the Relational Database in Figure 4, is a user-accessible, relational, atomic-level database built on dimensional models. This tier serves as the data foundation and is the cleansed, aligned, validated, atomic-level detail system of records for the data warehouse and BI system. The relational platform lets you store and manage the full data set. It also provides a level of ad hoc flexibility for certain kinds of queries (e.g., joins to new tables not in the original model) that are difficult to do in Analysis Services.
For the third tier, you build an OLAP cube in Analysis Services that replicates the relational dimensional database and functions as the primary user-query engine. Our practice is to build this tier as a MOLAP cube.The Analysis Services layer provides consistent performance by pre-aggregating, indexing, and compressing the data and delivers enhanced analytic functionality by using the MDX query language. Keeping the detailed data foundation in relational form lets you manage, redesign, and rebuild the Analysis Services layer when the need arises.
Why Dimensional Modeling?
Dimensional modeling supports our two main data warehouse and BI design principles: usability and performance. (For a quick explanation of why we use dimensional modeling, see "Modeling Myths.") The denormalization of the dimension tables improves usability by pulling all the related attributes together into one table, dramatically decreasing the total number of tables a user has to deal with. A business process that might require hundreds of tables in the OLTP system can be reduced to a dozen or so tables in its dimensional form.
From a performance perspective, dimensional models can help in several ways. Denormalizing the dimension tables during the load process reduces the number of joins the query engine must do at query time. Beyond this, the relational optimizer understands that the dimensional model is a large fact table with simple joins to several relatively small dimension tables in which the majority of constraints reside; the optimizer can formulate a query strategy that takes advantage of this structure. This strategy, known as a star join, can greatly improve the speed of most BI queries. Awareness of star-join optimization is built into the SQL Server 2005 relational engine. One of Analysis Services' main purposes is to improve performance by pre-aggregating across dimensions and up hierarchies within dimensions. Analysis Services can do this because it's dimensionally based.
Dimensional modeling is a powerful approach to building a robust, flexible data foundation for data warehousing and BI, and SQL Server 2005 wholeheartedly embraces the dimensional approach. This article only scratches the surface of dimensional modeling. To read more, consult the references listed in the Related Reading box.
RALPH KIMBALL AND MARGY ROSS
JOY MUNDY AND WARREN THORNTHWAITE