Data Warehousing: Measuring the Facts

In "Discover the Star Schema," July 2007, I reviewed the basics of dimensional modeling. You can use dimensional modeling to create an enterprise data warehouse that makes it easier for business intelligence (BI) end users to access and understand data. Now let's examine the differences between entity relationship (ER) and dimensional models, the components that make up the dimensional model, and why those pieces are designed the way they are.

Dimensional vs. ER Modeling

Dimensional modeling is a logical design technique. Unlike ER modeling, which consists of conceptual, logical, and physical data modeling, dimensional modeling is made up of only logical and physical modeling.

There are sharp contrasts between ER and dimensional modeling. ER modeling is a design discipline that seeks to represent business rules as highly detailed relationships between business elements that are materialized as database tables. You can extrapolate the business rules from the types and cardinalities of the relationships in an ER model. The primary goal of ER modeling is to remove all non-key data redundancy.

Dimensional modeling, however, seeks to represent data in a logical, understandable manner. In dimensional modeling, you can control data redundancy by conforming dimension and fact tables. A table that's been conformed can be used in more than one dimensional data model and is the same no matter how it's used. The relationships in a dimensional model don't represent business rules; instead, they're navigational paths used to help write reports or create graphs.

Many data modeling software packages support dimensional modeling. Some even let you generate SQL Data Definition Language (DDL) scripts so that all you have to do to create a data warehouse/data mart is run those scripts.

Determining the Facts

In the world of BI, data elements that can be summed, averaged, or otherwise mathematically manipulated are called measures. A dimensional model is designed to deliver numbers to BI users, and measures are the core of the model. In Figure 1, the fact table Reseller_Sales contains both measures and keys. Measures are data elements such as OrderQuantity, DiscountAmount, and TaxAmount that are used to determine all types of statistical information, such as the percentage of sales that were discounted or the total sales tax collected per region. Like foreign keys in an ER diagram (ERD), keys in a dimensional model are expressions of a relationship to a dimension table. In contrast to foreign keys in an ERD, however, the relationship isn't intended to enforce referential integrity; after all, the data has already been through the extraction, transformation, and loading (ETL) process, having been scrubbed and validated before being loaded into the data warehouse. Instead, the relationship's function is to associate keys in the fact table with the expanded definitions, which are found in the dimension tables.

In a data warehouse environment, you rarely retrieve just one record. Typically, hundreds, thousands, or even millions of records are retrieved in a single query, and the most logical thing to do with record sets that large is to perform a mathematical process on their data. That's why the section of a fact table that contains the numeric and additive columns is arguably the most important section.

You can start your dimensional design by reviewing the ER model of the transactional data source (the operational database). You can usually identify potential fact tables by locating the associative tables that represent the many-to-many (M:N) relationship in an ER model. An ER model will break down into multiple dimensional diagrams: The number of diagrams is determined by the number of functions in the organization and the organization's BI reporting needs.

To create a dimensional model from an ER model, first separate the ER model into its discrete business processes. Each business process can be expressed as a data mart—a modular, highly focused, richly detailed, incrementally designed component of the enterprise data warehouse. Although there are many ways to approach and implement an enterprise data warehouse, the data mart approach lets you tackle the data warehouse project business-process-by-business-process and produce deliverables for your user community.

If this is your first excursion into dimensional modeling, start with a single-source data mart; don't try to tackle multiple-source data marts until you've acquired the skill set necessary to design a complex data warehouse. Examples of single-source data marts include retail sales, purchase orders, shipments, and payments. An example of a multiple-source data mart is customer profitability, which combines revenue and costs that often come from separate transactional sources (e.g., the sales database and the inventory database, respectively).

For each preliminary data mart, identify the M:N relationships from the transactional model that are comprised of numeric and additive non-key data. These relationships will most likely be the associative tables in the ER model. Designate these as fact tables. It's not unusual for a large ER model to produce a dimensional model that has from 10 to 25—or even more—fact tables.

Understandability and Performance

The dimensional model, which becomes a star schema when physically implemented, is a lean, mean performance machine. When faced with a star schema, the database optimizer can depend on heuristics (i.e., rules) rather than on more time-consuming cost-based algorithms to resolve a query. First, the optimizer can constrain the dimension tables referenced in the query. Then, using the filtered key values from the dimension tables, the optimizer can resolve the many-way join to the fact table in a single sort-merge pass.

Note that as of right now, I haven't been able to confirm that the SQL Server 2005 Analysis Services (SSAS) optimizer uses heuristics rather than a cost-based analysis. However, I do know that SSAS uses thin-client architecture to minimize the load on the client computer, ensuring that SSAS will have scalable performance. The SSAS calculation engine is entirely server-based, so all queries are resolved on the server, optimizing the use of the corporate network. Each query, regardless of its complexity, requires just one round trip between the client and the server. For more information about dimensional modeling and SSAS, see "Dimensional Modeling Basics," April 2006.

The dimensional model is extensible and expandable; it can accommodate changes in user behavior, new data elements, and new ways of analyzing the data. You can easily change a fact table in place by executing the ALTER TABLE command—the table doesn't have to be dropped and recreated and the data doesn't have to be reloaded. You can add new (and unanticipated) columns to the fact table as long as they're consistent with the fact table's existing grain (as I explain later).

Going with the Grain

The grain of a fact table is the level of detail that the table captures. When designing a fact table, choose the grain very carefully—don't aggregate (i.e., summarize) prematurely. For example, a fact table whose records contain weekly sales summaries rather than individual sale amounts is a coarse-grained, aggregated fact table that doesn't support reports on individual sales.

If you aggregate prematurely, you won't be able to gracefully accommodate new sources of data. Design your table to have the finest possible granularity of data: the finer the granularity, the more robust the design. As you can see, the granularity of Reseller_Sales (the fact table in Figure 1) is at the order line item (SalesOrderLineNbr). That's the lowest and best grain for the data in a sales fact table. From this point, you can easily aggregate the fine-grained data into coarser measures, such as product or weekly sales, as required by the demands of the report-writing end users. Once you've captured the data at its finest grain, you can easily add new attributes to both the fact and the dimension tables.

Finely grained data consists of unsummarized detail records from an operational data source. Individual line items in a sales transaction, individual deposits and withdrawals in a banking transaction, individual line items on a shipping invoice, and individual attendance at an event can all be considered fine-grained data.

For performance reasons, you might find it useful to aggregate fact table records, and that's fine. Just remember that summarized or aggregated versions of the individual fact records are separate records that need to be stored in different fact tables that have a coarser grain. It's important not to mix detail and summary records in the same table: Store them separately. You'll be rewarded when you start using BI query and reporting tools that combine the facts along various dimensions.

Data Marts and Fact Tables

A single fact table does not a data mart make—at least, not usually. Most data marts consist of multiple coordinated fact tables that have similar structures because they're all derived from one business supply chain or value chain. An example of a manufacturing supply chain would be ingredient purchasing, ingredient inventory, bill of materials, manufacturing process control and costs, packaging, and finished goods inventory. Many companies have software systems that manage the flow of control through the supply chain. Usually, there's a data source for each step of the supply chain. Each data source can translate into a set of data marts and has at least one fact table. After the manufacturing step, the flow of control is called a value chain. An example of a retail value chain is ship from manufacturing, reseller inventory, reseller shipments, retail inventory, and retail sales.

The purpose of the data warehouse is to provide BI end users with a single source of information regarding supply and value chains so they can follow the data as it flows through the business cycle from beginning to end.

When doing dimensional designs, it helps to understand what kinds of information you can request from the data. For example, from fine-grained fact tables such as Reseller_Sales, you can analyze behavior and frequencies and perform behavior counts, such as the number of times a customer places an order on the date of a storewide inventory sale. You can do time-of-day analysis (e.g., determine whether online sales increase during the lunch hour) and queue analysis (e.g., determine highway capacity and traffic flow, packet traffic on IP networks, and customer movement within an online store). You might be able to do some sequential behavior analysis (e.g., does action A always result in action B?), which could in turn warn you of possible fraud or customer cancellation intent (e.g., this reseller is planning to drop out of your company's partner program). Finally, every retail data mart should lend itself to basket analysis (e.g., do people really buy more beer if it's placed near the baby diapers?) and missing-basket analysis (e.g., what didn't work the way we thought it would and why?). These are all valuable measures that can be derived only with the assistance of a fine-grained fact table containing records at the individual transaction level.

Building a User-Friendly Data Structure

A dimensional model lets you build a user-friendly data structure that makes data access intuitive. The fact table is the core of the dimensional model and contains keys from the dimension tables and the raw numbers that BI users will turn into information. Constructing the fact table so that it contains the finest grain of transactional data available will let you expand and extend your data warehouse as necessary. The dimensional model conforms easily as you coalesce multiple data marts into an enterprise data warehouse and supports a very wide variety of analytical questions that will help you determine your next move in the business world.

Comments

Plain text