Data Warehousing: Rapidly Changing Monster Dimensions

Monster dimensions require special handling and advanced planning, especially when you’re dealing with rapidly changing monster dimensions. Rapidly changing dimensions (i.e., dimensions whose content changes as a result of input errors or changes over time that you’re recording in place) aren’t too much of a challenge when the dimensions are small. However, monster dimensions consume large amounts of maintenance time and CPU cycles because of their size, so you need to plan carefully when you’re dealing with them. Let’s explore two different methods for managing rapidly changing monster dimensions: first, splitting the dimension, and second, creating a set of non-overlapping value ranges for each rapidly changing attribute and then creating rows that include all possible range combinations.

Split the Dimension

You can manage monster dimensions by separating the stable attributes from the attributes that change frequently (i.e., the “hot” attributes), and then link the split dimensions through a fact table. Doing so will shorten the length of the records that are being changed as the hot attributes are modified. This vertical partitioning offers great flexibility. You get better performance because you’re not reading and writing all the customers’ data all the time. One way to manage huge files, such as monster dimensions, is to define multiple data files per file group and to spread the data files across multiple disks. By doing so, you’ll exploit Windows’ native parallel I/O, which optimizes disk reads and writes.

Figure 1 shows a split monster dimension. At one time, all the attributes were in one large Customer dimension. But the number of rows grew so large that updates and retrievals became cumbersome, so the dimension was split into the Customer and CustDemographics dimensions. Customer contains stable and slowly changing data (customers’ names and addresses). Note that each customer record includes a current address and a previous address, which is an example of creating an “old value” column in the dimension record to store the previous value. (For more information about creating an old value column in the dimension, see “Data Warehousing: Slowly Changing Dimensions,” January 2008.) Here, this approach lets you deal with the slowly changing address information.

In this case, we’re not interested in maintaining an address history, but rather in how a change of address might affect the customers’ buying habits. CustDemographics contains all the rapidly changing attributes (i.e., income, education, family size and status, credit, and purchases) that correspond to factors influencing customer buying habits.

As with all dimension tables, Customer and CustDemographics will each have a surrogate primary key. Unlike in transactional database design, however, there’s no direct relationship between Customer and CustDemographics. Instead, Customer is related to CustDemographics through a fact table, as shown in Figure 1. When you relate the two dimensions through a fact table, you’re adhering to a single-level star schema, which is crucial when dealing with very large data warehouses. I don’t recommend snowflaking the design or creating a direct one-to-many (1:M) relationship between two dimensions because inter-dimensional relationships complicate the scenario and make it more difficult for business users to traverse the schema.

Band Hot Attributes

One downside to splitting a dimension is that if you’re adding a new record to CustDemographics whenever change occurs, you’ll have to edit and update both CustDemographics and the fact table. Each person in the Customer dimension will have exactly one corresponding row in the CustDemographics dimension, so the relationship between each dimension and the fact table will be 1:1, instead of 1:M. You’ll also be wrestling with not one but two monster-sized tables.

There are two ways around this problem: You can overwrite the values in CustDemographics as change occurs, or you can “band” the values in CustDemographics. (Note: Use this approach only if you don’t care about the history of change.) Banding an attribute means creating a set of non-overlapping value ranges for each rapidly changing attribute, and then creating rows in the CustDemographics dimension that include all possible range combinations. For example, if the values for Cust_Income (in Figure 1) were banded rather than discreet, you would see boundary values, such as $20,000 or $40,000, instead of discreet numbers (e.g., $75,357). You pre-populate CustDemographics with all possible combinations of these banded values, but you can predict how many rows will be in the dimension. If you have six attributes and each attribute is limited to 10 possible values, then the maximum number of rows in the dimension will be 106, or 1,000,000 rows. Now the relationship between each dimension and the fact table is 1:M, and you can capture change as it happens, within the limits defined by the banded values. When a customer’s income increases or decreases enough to move it from one income band to another, then the change is recorded in the fact table.

Figure 2 shows banding. Note that the boundary values (e.g., $20,000, $40,000) are “range right” (i.e., a value equal to $20,000 would be in the $20,000–$40,000 range). Banding causes CustDemographics to become a static list of values. Now, instead of having to edit or update both CustDemographics and the fact table, you have to edit or update only the fact table.

Also, banding reduces the fact table’s size. In the unbanded case, the relationship between CustDemographics and the fact table is 1:1, so for every row in CustDemographics there’s a corresponding row in the fact table. With banding, although the CustDemographics dimension is larger (because it contains all possible combinations of banded boundary values), not every combination will be used, and not every change in a customer demographics attribute will be recorded in the fact table.

The fact table contains an entry for each customer during the first recorded time period—in this case, the year 2005. However, you can expect some customers to not have yearly entries in the fact table after the first recorded time period. For example, you can see in Figure 2 that CustKEY 1 (Abby) has an entry for 2005 and 2006, but not for 2007. The implication is that the changes that happened to Abby in 2007 weren’t enough to change her demographics from 2006.

One drawback to banding is that by restricting fact table sampling to ranges instead of discreet values, you can miss fine details. Also, once you’ve chosen the boundary values, you’re stuck with them. You can combine two adjacent values (i.e., make a wider band), but you can’t create smaller bands and stay consistent with data you’ve already collected.

Be wary of creating too many bands because the total number of rows in the banded dimension is the number of boundary values raised to the power of the number of attributes you’re banding. Consider splitting CustDemographics even further. Dimensions such as income/education and family demographics offer even more flexibility than the original CustDemographics dimension.

If you’ve been following the discussion thus far, you’re probably already asking “Isn’t this demographic fact table going to be monstrously large?” The short answer is yes, and next month, I’ll discuss how to partition monstrously large fact tables.

Comments

Plain text