Data Warehousing: Dimension Basics

I covered the basics of the fact table and how it fits into the dimensional model in "Data Warehousing: Measuring the Facts," September 2007. Now let's address the fundamental design of the dimension - the "arms" of the star schema. Dimensions are tables that contain textual explanations of the coded and key values that are present in the star schema's fact table. The topic of dimensional modeling encompasses a large body of work, so if I miss something that you think should have been included, it's probably for lack of space and not for lack of interest or importance.

Defining Dimensions

Dimensions are a common way of analyzing data. In the article above and in "Discover the Star Schema," July 2007, I implied that the dimensional model is user-driven; now I'm going to say it loud and clear: The dimensional model is laid out for ease of use and to be logical and understandable to business users. Dimensions define the dimensional model and spell out the who, what, where, when, why, and how of the situation. They add context and meaning to the measures contained in the associated fact table.

In a dimension table, there is one row per product, one row per customer, one row per facility, one row per time unit, or one row per geographic region, and the granularity or level of detail of these rows depends on how finely detailed the business user wants the data. For instance, is it enough to know the average dollar value of each sale? Or do you need to know the average dollar value of a sale for each day of the year? Do you need to know how much of the sale value was for perishable itemsversus nonperishable items? Do you need to know purchases down to the line item level? It's very important to understand what the business reporting needs are before you start designing a data warehouse. You simply cannot be a cowboy and start slinging data around with data warehousing and dimensional design - there's too much at stake: Too much data is involved in a data warehouse, the project visibility is too high, and it's too costly to undo careless mistakes. You must plan. You must know how granular the level of detail needs to be.

Structuring Dimensions

At this point, if you're thinking "cube," think about the flat surfaces of the cube - those are the dimensions. Dimensions control query filtering - for example, "...where region like West" in T-SQL or "WHERE (Region.West)" in MDX - and supply almost all labeling in the output result sets that are turned into reports. Dimensions provide the "by" words for a report - "sales by month by product," "inventory by category by region." When you think about it, dimensions are the natural way a business user would talk about business. Each dimension contains data from a single domain, such as the time, product, or geographic area domain. Dimensions act like decoders when you "flatten" the lookup table hierarchy, as Figure 1 shows. If the three tables on the left were part of an entity relationship diagram (ERD) for an online transaction processing (OLTP) database, Product_Category and Product_Subcategory would be lookup tables in a hierarchy, with Product_Category higher in the hierarchy than Product_Subcategory. The purpose of a lookup table is twofold: to enforce domain integrity by limiting the list of codes that can be used in the modified table, and to help reduce data redundancy by adhering to a production level of third normal form.

In a dimensional model, you'd consider flattening the hierarchy of products, categories, and subcategories, as you can see in the Product_ With_Category table. Each row in the Product_With_Category dimension contains all the category content that's relevant for that row, in addition to all the product information. You don't need the one-to-many relationships of an ERD because you shouldn't have to worry about domain integrity. The data has already gone through the extraction, transformation, and loading (ETL) process and has (presumably) been scrubbed and validated before loading into the data warehouse. But what about redundancy? With a data warehouse, the goal is not to reduce redundancy - data warehousing involves a different mindset from the transactional, operational database. If you're going to be building multidimensional cubes, you're better off flattening the hierarchies than not.

Flattening hierarchies also makes the join paths between the dimensions and their associated fact tables much simpler, resulting in better performance and faster query results (we hope). The one-to-many relationships between the dimension and its associated fact table, illustrated in Figure 2 aren't meant to protect data integrity. These relationships are meant to associate keys in the fact table with the expandeddefinitions, which are found in the dimensions.

In any database there are hierarchies. Geographic hierarchies occur where zip codes aggregate into counties, counties into states, states into countries. With temporal hierarchies, hours aggregate into days, days into weeks and months, months into quarters, and quarters into years. In spatial hierarchies, rooms aggregate into buildings, and buildings into campuses or city blocks, and so on.

Hierarchies such as these and the product category-subcategory hierarchy in Figures 1 and 2 are used in reporting and summarizing results. Depending on what information users require, they might use the hierarchies differently from other users. One of the most common scenarios is the dichotomy between marketing and manufacturing: What manufacturing considers one product might be many products to marketing, as when hospital scrubs were co-opted by young adults as the latest fashion statement. When generating reports, manufacturing might not want to use product subcategories, whereas marketing would.

The flattening example in Figure 1 is the type of denormalization you'll often see in a dimensional model. We denormalize to avoid an excessive number of join operations. In future columns I'll examine dimensional denormalization more closely, but for now, just remember to denormalize in such a way that you lose nothing except the OLTP structural complexity.

Dimensional Granularity

In my September article cited above, I explained that fact tables have "grain," or levels of detail. It turns out that dimensions also have grain, and the granularity of a dimension depends on the reporting requirements - just like the fact table's granularity. Unlike any other database modeling scheme, dimensional modeling is truly customer-driven.

How fine-grained do you want your dimensions to be? Obviously, the level of detail of the dimension table has to be consistent with the level of detail of the fact table that it modifies. If you have a fact table that tells you that customer A bought product B at store C in region F on date D and paid for the purchase using E, and that the sale was credited to sales associate G, then all those dimensions, A through G, had better be able to define and describe the A through G identifiers in the fact table.

One school of thought holds that a data warehouse should contain only summary data and that all available detail data should be stored in the associated OLTP source databases. That approach might work in some cases. However, I believe that fact tables and dimensions are an archive of historical activity and should contain the finest level of detail that you'll ever need for a report. One of the largest data warehouses in the world belongs to Wal-Mart - it's measured in petabytes (thousands of terabytes) and holds one and a half years of data at the individual sale level. If you need to be able to analyze individual sales in your organization, you'll need the finest level of granularity. It's better to create a data warehouse that stores data with a fine level of detail and not need that detail, than to start with summarized data and then realize you need more detail.

Understandability and Performance

The twin properties of a data warehouse are understandability and performance. By flattening hierarchies and not losing any data in the process, you make the dimension more understandable to the people who use the data. What can you do to enhance performance? You guessed it - flatten the hierarchies! One of the most common user activities in a dimensional model is drilling down to get more details. Drilling down means adding more row headers to the result set. Drilling up simply takes away those row headers. You could drill down through the product category into a specific product subcategory until you found all the products in that subcategory, and never leave the Product_With_Category dimension.

The Sales_Territory dimension in Figure 2 is an excellent example of a dimension suited for drilldown. You can start with the Americas (zone), move on to the United States, Western region, Colorado, Denver County, 80208, and do all that with a single table, meaning no joins (at least not until you want some sales facts that relate to zip code 80208). Sales_Territory is a richly defined geographic dimension with a flattened hierarchy.

Flattening hierarchies is the same as eliminating snowflaking in the star schema. Hierarchies expressed as separate tables (such as the three tables on the left side of Figure 1) create snowflakes, which complicate the user presentation and often intimidate users, causing them to shy away from your data warehouse. Snowflaking (normalizing the hierarchy) causes most data warehouse applications to run relatively slowly because of all the joins in the background, lowering the quality of the user experience. The savings on disk space that you achieve with snowflaking isn't worth it. For more discussion about snowflaking and disk space, see "Are You Really Saving Anything By Normalizing?") And finally, snowflaking defeats the purpose of bitmap indexing, which SQL Server Analysis Services uses internally to optimize query response. To learn more about bitmap indexing and when it's used, see "What's a Bitmap Index?"

Dimensions Illuminate Facts

When you analyze data in a data warehouse, you start from a dimension. Dimensions add meaning and context to the fact table in a dimensional model and make the data accessible in an intuitive way. Because there's so much more to say about dimensions, I'll continue this discussion in a future article. Until then, you can go to the Forum link at the beginning of this article to communicate with me and other readers about the fascinating topic of dimensions and data warehousing.

Comments

Plain text