Data Warehousing: Slowly Changing Dimensions

Change is inevitable; nothing stays the same for long. That’s why such careful precautions are taken to minimize the impact of change in transactional database design. We choose primary key candidates, which have the least possibility of being modified, because we know the complexity that’s involved in changing a primary key value in a typical OLTP database. Change even happens in data warehouses, especially in data warehouses whose sole purpose is to archive historical data. In fact, historical archiving mandates that you have a scheme in place to capture changes that will happen to people, places, and things as time passes.

In data warehousing, we’re faced with developing templates or standard ways of defining and addressing content changes in the data warehouse. You’ve probably heard of slowly changing dimensions, which are dimensions whose content changes over time. I’m not talking about just adding records to the dimension; I’m talking about changing existing records. Although this seems counterintuitive in a data warehouse environment, it’s not an unusual situation, and it’s often resolved by modeling the change on a sequence of templates. Which template you use depends on the kind of slow change your data warehouse dimension content is experiencing.

Your Response to Change

When change happens, how will you react? With slowly changing dimensions, you can respond in one of three ways: you can overwrite old data with new data (I call this a type 1 response), you can create a new dimension record for the time period that defines the change (I call this a type 2 response), or you can create an “old value” column in the dimension record to store the previous value (I call this a type 3 response). The response type you choose will depend on your organization’s requirements for maintaining and archiving data.

Type 1 Response: Overwrite the Old Data Value

This response type involves overwriting the old data value when the old value has no significance and can be discarded, such as when you’re correcting an error. Typically, you’ve written the extraction, transformation, and loading (ETL) routines correctly, so there shouldn’t be any errors, but despite our best efforts, mistakes happen sometimes.

Before you decide to respond to a slowly changing dimension by overwriting the old data value, make sure that you can discard a previous value. Some organizations, such as financial and audit organizations, are so bound by compliance requirements that overwriting the old data value isn’t an option.

Type 2 Response: Create a New Dimension Record

This response type is useful when you have audit trails and other changes that neatly partition history. For example, prior to 12:14:55 a.m. the flow converter was running at 95 percent capacity; after this time it was operating at only 50 percent capacity. Adding a record to capture this change in flow rate is the technique most often used to track measurable values, such as employee job performance, customer status, and product availability. To effectively partition time so that there’s no overlap in the periods measured, add an EffectiveDate (i.e., the date/time this change first appeared) column and an EndDate (after which date/time this value is no longer valid) column to each record, as shown in Table 1.

Note that in Table 1, the current condition (i.e., the current job description) is represented by the row in which the end date is NULL. As you recall from transactional database design theory, NULL can mean that at the time the measurement was taken there was no value; that there might have been a value, but at the time, no one knew what that value was; or that the measurement wasn’t applicable to this record. In this case, there’s no value for the end date on record number four because that record represents the current situation.

Before using a type 2 response to change, make sure that the data mining software your company is using can understand and compensate for NULL—not every product can. If you determine that you can’t use NULL, leave the end date as an empty string (which SQL Server 2005/2000 will convert to a value of 1/1/1900 if you’re using a datetime data type) and add another column called CurrentRecord that can be flagged on or off.

Don’t confuse the dates in this slowly changing dimension with the dates in any associated fact table. The dimension dates are used to partition sets of records into time slices. It’s not appropriate to join these dimension dates to the dates in a fact table.

When you’re querying a type 2 dimension and the query keys on an identifier that represents just a single time slice, then you can expect only a single record to be returned. If you query attributes that aren’t part of the slowly changing dimension criteria, multiple records will be returned.

When employing a type 2 response, you must use a surrogate (e.g., the SQL Server IDENTITY property) for the primary key because you’ll have multiple versions of the same record as that record changes over time. I don’t recommend trying to use a meaningful natural key concatenated to a date/time value as a dimensional primary key.

Type 3 Response: Create an Old Value Column

“Soft” changes (i.e., tentative changes or changes resulting because you want to track a single attribute value for a limited time—examples of soft changes might be sales regions or geographic areas) lend themselves to a type 3 response. In this design, you create an additional column in the slowly changing dimension for the old value. When change happens, just copy the current value into the old value column (e.g., Old Discount Percent in Web Table 1, InstantDoc ID 97409) and insert the new value into the current value column. There’s no need to create a new dimension record. The type 3 response is best used when the time slices induced by change aren’t clear-cut (in other words, the change was gradual) or where the time slices overlap.

DimDiscountID	Discount Type	Low Quantity	High Quantity	Discount Percent	Old Discount Percent	Change Date
1	Initial Customer	NULL	NULL	10	8	1/1/2007
2	Volume Discount	100	999	6.7	5.7	1/1/2007
3	Big Volume Discount	1000	NULL	7.7	6.7	1/1/2007
4	Good Customer Discount	NULL	NULL	7.5	5	1/1/2007
Web Table 1

The benefit to using a type 3 response is that all production applications will continue to return the most current results, with no need to modify any queries. You can access the previous value by creating a new query that refers to the old value column. If you want to incorporate a limited history, add another column called Change Date with a default value of getdate() or current_timestamp, as shown in Web Table 1. Then you can not only access the previous value, but also see when the change occurred.

The problem with using a type 3 response is that when you have more than one soft change per dimension, the table schema begins to get very messy, especially if you’re tracking when each change happened. The row size in bytes can quickly grow, adversely affecting file I/O and adding to the length of time necessary for query responses and joins. If you have multiple soft changes in one row of the slowly changing dimension, then you must strictly adhere to strong naming standards for the old value and Change Date columns. Failure to do so could result in confusion and erroneous data on your reports. This response type is best used with a single soft change for each dimension.

Rates of Change

This discussion of slowly changing dimensions wouldn’t be complete without mentioning the rates of change. Although these dimensions are called slowly changing dimensions, they don’t always change slowly. Even more important, slow change to one organization could be rapid change to another.

When you want to capture change in a dimension, and the change is in disjointed time slices, then it’s usually best to create a new column to record the change for dimensions that are a “normal” size, whether they’re changing slowly or rapidly. I say “usually” because it’s very hard (if not impossible) to set absolute rules for designing data warehouses; sometimes you simply have to look at the specific situation and make a judgment call.

What constitutes a “normal” size for a dimension? Typically about 300 to 100,000 records. For example, a company might have 100,000 employees in its HR warehouse, and a value in each employee record might be modified once a month. Even if the employee record is long (e.g., 2000 bytes), it’s still a good candidate for the type 2 response mentioned above.

Rapidly changing dimensions aren’t too much of a problem when the dimensions are small, but how do you handle rapidly changing monster dimensions? Rapidly changing monster dimensions warrant their own article because each design technique is moderately complicated and has drawbacks. Next month, I’ll discuss how to manage rapidly changing monster dimensions.

Comments

Plain text