When you’re designing a data warehouse, you’ll occasionally run into attributes from the source database(s) that don’t fit into neat, tight star schemas. We’ve all seen OLTP tables that are full of flag fields and yes/no attributes, many of which are used for operational support and have no documentation except for the column names and the memory banks of the person who created them. So how should you handle open-ended text and comment attributes, many of which are badly designed in the OLTP schemas? Not only do those types of attributes not integrate easily into conventional dimensions such as Customer, Vendor, Time, Location, and Product, but you also don’t want to carry bad design into the data warehouse. However, some of the miscellaneous attributes will contain data that has significant business value, so you have to do something with them.
There are three conventional ways to deal with these attributes: discard all of the miscellaneous attributes, eliminating them from the dimensional design; incorporate the miscellaneous attributes into the fact table; or make each miscellaneous attribute a separate dimension. However, all of these options are less than ideal. Discarding the data can be dangerous because the miscellaneous values, flags, and yes/no fields might contain valuable business data. Including the miscellaneous attributes in the fact table could cause the fact table to swell to alarming proportions, especially if you have more than just a few miscellaneous attributes. The increased size of the fact table could cause serious performance problems because of the reduced number of records per physical I/O. Even if you tried to index these fields to minimize the performance problems, you still wouldn’t gain anything because so many of the miscellaneous fields contain flag values such as 0 and 1; Y and N; or open, pending, and closed. (For more information about indexing, see “Indexing Dos and Don’ts,” January 2003.) And if you make each miscellaneous attribute a separate dimension, it will most likely result in a complicated dimensional design. For example, a star schema for a shop floor or manufacturing activity would include the standard dimensions of people (e.g., customers, vendors, employees), time, location, and inventory or product, which is a tight, straightforward design for a four-dimensional cube. But once you start incorporating 10 or 20 additional dimensions, each of which is a yes/no attribute, a status field, or an open-ended comment, you’re looking at a much more complicated star schema and associated cube.
Determining the Value of Miscellaneous Attributes
You can determine which miscellaneous attributes have business value through a process called discovery. You have to ask the right people the right questions to determine if these miscellaneous flag and text attributes are truly useful, and you can’t be satisfied with only one person’s opinion. For example, someone in the production department might consider the open-ended comments to be truly important, while someone in the sales department might not think so. Your discovery process has to be all-encompassing and thorough. You must understand how the data in transactional databases is used and for whom it has value before you can determine if the miscellaneous attributes should be retained or discarded. Here are my suggestions for how to handle flag or comment attributes that have business value.
Handling Flag Fields and Yes/No Attributes
I’m sure you’ve seen many examples of flag fields and yes/no data that have no documentation regarding how they’re used. This scenario is especially common in legacy systems and databases that were created without solid, underlying design principles. Column names such as Completed, Packed, Shipped, Received, Delivered, and Returned (each with yes/no data values) are very common, and they do have business value. Instead of discarding flag fields and yes/no attributes, I suggest placing them all into a junk dimension that’s organized as shown in Figure 1.
The junk dimension shown in Figure 1 represents an order-fulfillment system; the column headers show some of the possible statuses an item that has been ordered can have. Row 1 indicates that the item ordered has been picked out of the warehouse, packed for shipment, shipped, delivered, received, returned for a refund, and restocked in the warehouse. Row 9 shows an item on order that’s waiting to begin the order-fulfillment process. The rows in between indicate items that are in various stages of the orderfulfillment process. This example is very simple because the process is so linear and sequential, and NULL conditions aren’t allowed in this transactional database. As in any dimensional design, each of the rows in the fact table will be associated with a row in this junk dimension. Even if your set of miscellaneous attributes isn’t as sequential or you have to create a dimension that contains all possible combinations of yes and no, there will still be only 256 records in the entire dimension. As with all dimensional design, don’t forget to add an identity column to the junk dimension as a primary key and to include the junk key column in the fact table as a foreign key.
Handling Comment and Open-ended Text Attributes
You can handle comment data and free-form, openended text fields by creating a special text-based junk dimension. If the verbiage in these fields is potentially valuable, create the text-based junk dimension with two columns, the key column and the text column, as shown in Figure 2. Include the text junk key column in the fact table as a foreign key, and don’t forget to add a “no comment” record to the text-based junk dimension for those facts that have no associated text. Most likely, these text fields in the source systems will be used only sporadically, so the text-based junk dimension will be much smaller than its associated fact table. Figure 2 also shows how the yes/no attribute junk dimension described earlier will relate to the fact table.
Keeping Your Data Warehouse Design Simple
You want to keep the data warehouse design as simple and straightforward as possible, so that users will be able to access data easily. Miscellaneous attributes that contain business value are a challenge to include in your data warehouse design because they don’t fit neatly into conventional dimensions, and if improperly handled, can cause the data warehouse to swell in size and perform suboptimally. By placing miscellaneous attributes into junk dimensions, you can circumvent both of these problems.