Data Warehousing: Dimensional Role-Playing

Role-playing isn’t limited to the Broadway stage; data warehouse dimensions can also assume roles. Dimensional roles can be used not only for precision when you’re defining the purpose or function of a set of rows from a dimension, but also to easily maintain the data because there’s only one table to manage instead of many tables to synchronize.

Dimensional role-playing occurs when a single dimension needs to be part of the same fact table many times. For example, dates and times appear in most types of analysis because business activities happen in a timeframe and objects exist in time. Time is almost always used when calculating and evaluating measures in a fact table. Each event constitutes a link to the time dimension. For example, a customer order was taken on a specific date; the order was picked and packed at the warehouse on one or more dates; the order was shipped to the customer on a date; the order has a delivery date; the order’s payment was received on a date; and sometimes, the order was returned on a date and a refund was issued on a date. All of these dates need to be recorded, stored, and analyzed because they all play a role in the retail sale life cycle.

Defining Roles

When many same-type events, such as dates, are present in a single fact table, you have an accumulating snapshot fact table. (Of course, there are other types of fact tables, which I’ll discuss in upcoming articles.) Figure 1 shows the Shipping star schema. The fact table (i.e., factShipping) is an accumulating snapshot; it contains dates for multiple events, such as scheduled and actual departure and arrival dates and times, where each shipment starts and ends, and the buyer and the seller. We could record a lot more discreet date and time events in this fact table, but no others are necessary for this example.

This star schema isn’t a cube; it’s a set of relational tables that can be managed and manipulated using SQL Server Enterprise Manager or SQL Server Management Studio (SSMS). Clearly, it isn’t the OLTP tables that are used to initially capture the retail events because OLTP tables are more highly normalized than the tables in Figure 1. Also, the join or associative table, which is the OLTP counterpart to the OLAP fact table, would typically have fewer attributes than this fact table. These tables are containers that hold the source data for the OLAP Services/Analysis Services multidimensional databases (cubes). Another name for a multidimensional database is the staging database or data store. The staging database receives verified and validated data via the extraction, transformation, and loading (ETL) processes from various external data sources, including OLTP databases. (This staging database is also an excellent reporting database because the veracity of its data content and the logical orderliness of its architecture make sense to business users. Business users armed with Microsoft Excel can build ad hoc reports without having to rely on overworked business intelligence (BI) programmers and analysts.)

Take a look at the views in Figure 1, which you can identify because view names start with “v” (e.g., vSeller). (Note that in Figure 1, dimension table names start with “dim” and fact table names start with “fact.”) I’ve built views on each of the three dimensions, Time, Person, and Location. In this star schema, I’m linking the views rather than using the dimensions to link to the fact table. Now it’s easy to visualize that each shipment has a buyer and a seller, a starting and an ending location, and multiple date and time events associated with each record in the fact table. The views (aka roles) serve as virtual dimensions.

When creating a dimensional role, you can filter data. For example, if the company won’t ship orders on weekends or holidays, it’s easy to constrain the shipping date views with the following selection criteria:

select …
from dimTIME
where DayNumberOfWeek between 2 and 6
and Holiday = 0

Or you can separate buyers from all other people when creating the vBuyer view by running the following command:

create view vBuyer
AS
select …
from dimPerson
where PersonType = ‘buyer’

Supertypes and Subtypes

If something seems familiar about this discussion of dimensional roles, the reason might be that I’ve discussed the concepts of supertypes and subtypes in previous columns. Dimensional role-playing combines data subtyping and filtering into one operation. For example, in Figure 1, dimPerson is a supertype and vBuyer and vSeller are virtual subtype entities. Using the WHERE criterion, the WHERE clause of the query that creates the subsets of data for each of the views acts as a data filter and establishes the subtype, which then ensures that only buyers are listed in the vBuyer dimensional view and only sellers are listed in the vSeller dimensional view.

Each of the roles that these dimensions play in the accumulating snapshot fact table is a little different, and it’s important to relabel the view dimension and its attributes to better express the role meaning. I’ve done that in Figure 1: For example, the time dimension has morphed into vScheduledShipDate, and TimeKEY has become SchedShipDateKEY. There’s little chance that a business user would use vScheduledShipDate under the assumption that it represented arrival times. Remember, one characteristic of a data warehouse is ease of understanding, and simple naming conventions make the roles easy to understand.

The second characteristic of a data warehouse is performance. As anyone who works in data warehousing will tell you, proper indexing is crucial to good performance, so you should consider indexing views. Because of the simplicity of the dimensional table structure—a surrogate key, a flattened set of attributes—the dimensions lend themselves nicely to serving as base tables for indexed views. Also, the indexed view is dynamically updated as the base table changes, so you only have to manage the dimension table when you’re executing ETL operations.

Dimensional Role-Playing

If you’re still having trouble understanding how dimensional roleplaying works, visualize the last time you flew in an airplane. There was a scheduled departure date/ time, an actual departure date/time, a scheduled arrival date/time, and an actual arrival date/time. Then there was the time on the ground at the point of departure (calculated as the time interval between when the jet pulled away from the gate until the wheels were off the ground), the time spent in the air traveling from the departure airport to the arrival airport, and the time on the ground at the point of arrival (calculated as the time interval between touch down and when the jet docked at the gate). These times are critical for conventional airline flight analysis, and each would have an entry in a fact table that would be used to calculate airline flight on-time percentages.

Comments

Plain text