SSIS Novices’ Guide to Data Warehouses: Flattening While Staging the Data

In "SSIS Novices' Guide to Data Warehouses: Moving Data Into the Data Warehouse," I showed you the basic structure of a data warehouse whose databases contain sets of tables that store raw, staged, and dimensionally modeled data. These tables are referred to as the Raw tables, Stage tables, and Dimensional tables, respectively. I also showed you how to create a SQL Server Integration Services (SSIS) package called the Raw package. This package is used to move a near exact copy of your source data from an external location (probably the transactional database and server) to the Raw tables.

The next step in building a data warehouse is moving the data from the Raw tables to the Stage tables, which is referred to as staging the data. During the staging step, many changes are made to both the data's structure and content. The changes are driven by business rules and dimensional architectural needs.

When explaining the kind of changes that take place in the staging step, I like to refer to something called the Five F Words. Although some aspects of data warehousing can be frustrating, none of these F words will get you in trouble at work. In fact, only one has four letters, and that word is Flag.

As Figure 1 shows, the Five F Words are Flatten, Fix, Flag, Filter, and Figure. In this article, I'll cover Flatten. I like to start with this F Word because flattening data is often the hardest concept for new warehouse designers to grasp, especially if they come from a transactional database design background.

Figure 1: The Five F Words

Flattening 101

Designing a transactional database is all about getting it into third normal form (3NF). Normalization becomes second nature to database developers, which is why data warehouse design often feels very foreign to database developers who are building their first data warehouse. Many tables within warehouses are denormalized or flattened when compared to their transactional system counterparts. Transactional systems need to load new data quickly and can return report data slowly, as reporting is a secondary, less crucial activity. Warehouses can load data slowly (often in the wee hours of the morning), but should return report and ad-hoc query data quickly, as reporting is the primary function of a data warehouse. Denormalization or flattening of data enables fast reporting.

To denormalize a database, you combine one or more tables into a single table. In a standard, no-frills dimensional database design, you flatten the various tables of the transactional system so that relationships between the tables form a star pattern. This is referred to as a star schema. (Other schemas exist, but I recommend that new designers always start with the star schema.)

As a general rule, transactions and events from the transactional database become facts, whereas lookup tables and profile data become dimensions. The fact table becomes the center of the dimensional model; each dimension hangs off the fact table, just one key relationship away from the center of the model.

Knowing which tables need to be flattened is the essence of dimensional database design. New data warehouse developers shouldn't expect to get it right the first time, every time. But take heart in the idea that it will get easier the more you do it.

Flattening StateUBags' Transactional System

In "SSIS Novices' Guide to Data Warehouses: Moving Data Into the Data Warehouse," I presented a sample scenario in which the StateUBags website sells backpacks and messenger bags to college students. As Figure 2 shows, StateUBags' transactional database system uses an Order table to track sales. The Order table rows have child records in the Order Detail table.

Figure 2: Transactional database model

To move from a fully normalized model to a denormalized dimensional model, you typically need to combine, add, and remove tables. For this example, you need to combine the Brand, Line, Product, and Manufacturer tables into a single denormalized table, as Figure 3 shows. You then link this new dimension to the Fact Orders table through the Product ID. Similarly, you need to combine the Color and Color Group tables into a single denormalized table, which links to the Fact Orders table through the Color ID.

Figure 3: Dimensional database model

To demonstrate how this will create a denormalized structure, suppose that product "A" is always the color "Daffodil" and the Color Group for "Daffodil" products is always "Happy Colors." In a normalized structure, product "A" would be defined in the product table with a list of attributes. One of these attributes would be a link to the "Daffodil" value in the Color table. A link in the Color table would then note that "Daffodil" is in the Color Group named "Happy Colors."

If you have 1,000 transactions in which product "A" was sold, all 1,000 rows would point back to a single row in the Color table, which would in turn point to a single row in the Color Group table. The word "Daffodil" would be stored in the database just once in that single row in the Color table and the label "Happy Colors" would also appear just once in a single row in the Color Group table. When reports were run from the normalized database for product "A", the word "Daffodil" and the label "Happy Colors" would be assigned back to product "A" through a lookup process.

In a normalized model, it's considered redundant and very bad practice to store "Daffodil" in every "A" row if "A" always resolves to the color "Daffodil". However, in a dimensional model, this is exactly what you do. You flatten things out so that if you have 1,000 transactions in which product "A" was sold, you have 1,000 rows that include either the actual word "Daffodil" or a key tying back to the "Daffodil" row. Likewise, with the denormalization of Color and Color Group into a single table, each Color assigned to the "Happy Colors" group will have the words "Happy Colors" repeated within its row. So, if "Tangerine" and "Cornflower" are both in the "Happy Colors" group along with "Daffodil", you'll now see the phrase "Happy Colors" appear in all three rows.

Another flattening you need to perform is with the Order and Order Detail tables. These tables form the core of the Fact Orders table. During this flattening, you also need to include the Product table because this is an easy way to get the Style, Pattern, and Color IDs. (The Product table is used twice in the stage loads.) Similarly, the Customer and Customer Address tables are part of the flattening to get the customer IDs and customer zip codes.

The StateUBags staff wants to be able to look at orders by zip codes rather than by individual customers. This creates the need for a new Geography table and obviates the need for the Customer and Customer Address tables in the dimensional model. The Geography table links to the Fact Orders table through a Zip Code field. You can build the Geography dimension using zip code data that's readily available on the web. Finally, you can use the Pattern and Style tables from the transactional system in the dimensional model with very little change.

Now that you know the general process for getting from the transactional database model to the dimensional model, I'll walk you through the process for creating the Color dimension table and Fact Orders table. At the top of the page, you'll find a script, BuildYourFirstDW_Pt2.sql, that creates the database (SUBStage) and the tables needed to follow along. In addition, you'll find the completed SSIS package (From_Raw_to_Stage.dtsx). To download the script and SSIS package, click the "Download the Code" icon at the top of this article.

Creating the Color Dimension Table

In "SSIS Novices' Guide to Data Warehouses: Moving Data Into the Data Warehouse," I showed you how to create the Raw package and add it to an SSIS project. Now you need to create an SSIS package to hold the Stage table loads. Typically, you would create separate SSIS packages for the different Stage table loads, but for simplicity let's place several load processes into a single package.

With your project open in Business Intelligence Development Studio (BIDS), right-click the SSIS Packages folder and click New SSIS Package. After the package has been added to your BIDS project, change the name to From_Raw_to_Stage.dtsx. When prompted, confirm that you want to rename the package object as well.

Now you need to create two data connections for your new SSIS package. Right-click somewhere in the Connection Managers area (it's at the bottom of the center pane) and choose New OLE DB Connection. Create one connection that points to the server and database that house your Raw tables. Create another connection that points to the server and database that house your Stage tables. You can name these connections to your liking, but I'll refer to them here as the Raw and Stage connections. Note that if your Raw and Stage tables are on the same server and in the same database, you can create a single connection. However, I recommend creating two connections anyway because it'll give you the flexibility to separate the table groups in the future.

In the Toolbox, locate the Data Flow Task and drag it onto the design surface of the Control Flow tab. Edit the name of the Data Flow Task to something meaningful. For this example, I'll name it Load Color Dimension Staging Table.

Double-click the new Data Flow Task, which will move you to the Data Flow tab. On the Data Flow tab, locate the Data Flow Sources group in the Toolbox, find the OLE DB Source component within that group, and drag it onto the design surface. Change the name of the source component to something meaningful, such as Flatten Color and Color Group tables.

Double-click the new source component to open the OLE DB Source Editor. As Figure 4 shows, choose Connection Manager, then select the connection that points to the Raw tables. In the Data access mode drop-down list, choose SQL command. In the SQL command text box, type or paste a SQL query that flattens two or more lookup tables from your source system.

Figure 4: Source component for the Color dimension load

Listing 1 shows the code that flattens the Color Group and Color tables. Click the Preview Button to confirm that the SQL command you entered is valid and returns the data you expect it to. If the results look good, close the preview window and click OK to close the OLE DB Source Editor.

SELECT
    cg.ColorGroupID
  , ColorGroupCode
  , ColorGroupName
  , ColorID
  , ColorCode
  , ColorName
FROM [SUBStage].[dbo].[rawColorGroup] cg
JOIN [SubStage].[dbo].[rawColor] c
ON cg.ColorGroupID = c.ColorGroupID

On the Data Flow tab, locate the Data Flow Destinations group in the Toolbox, find the OLE DB Destination component within that group, and drag it onto the design surface. Rename the destination component to something like Load stgColor. Click the source component to give it focus; two arrows will appear. Drag the green arrow from the source component to the destination component to connect the two.

Next, double-click the destination component to open the OLE DB Destination Editor. Choose Connection Manager, then select the connection that points to the Stage tables. In the Data access mode drop-down list, choose Table or view. You don't have a destination table yet, so click the New button next to the Name of the table or the view drop-down list. SSIS automatically writes the code to create the table. Carefully review the CREATE TABLE statement and edit it to meet your needs. In this case, you need to:

Change the name of the table to stgColor.
Add a new identity column named DWColorID.
Add a new column named StartDate.

(I'll discuss the purpose of the DWColorID and StartDate columns shortly.) Figure 5 shows the edited CREATE TABLE statement. Click OK to create the table and close the Create Table dialog box.

Figure 5: Destination component of the Color dimension load

Next, click Mappings, which is in the left pane of the OLE DB Destination Editor. Just by clicking mappings and viewing that screen, the warning message alerting you to perform the column mappings will be satisfied. You're not likely to need to make any column mapping changes at this point. Click OK to close the OLE DB Destination Editor and return to the Data Flow tab.

At this point, you need to add an SSIS transformation that will enable you to put dates into the StartDate fields. Click the green arrow connecting the source and destination components and delete it. Drag the source and destination components so they are a bit further apart on the screen. Locate the Data Flow Transformation type called Derived Column in the Toolbox. Drag a Derived Column transformation between the source and destination components. Rename it to something like Add a StartDate Column. Drag a green arrow from the source component to the Derived Column transformation, then drag a green arrow from the Derived Column transformation to the destination component.

Double-click the Derived Column transformation to open the Transformation Editor. Name the derived column StartDate. It will be a new column, not a replacement for an existing column, so leave the default value of under the Derived Column heading. For the expression, enter

(DT_DATE)GETDate()

Click OK to close the Transformation Editor.

You now need to map the derived column to the destination column. Double-click the destination component and go to the Mappings screen. Map the StartDate column under Input Column to the StartDate column under Destination Column, as Figure 6 shows. Notice that in Figure 6 nothing is mapped to the DWColorID column. This is acceptable because that column is set up as an identity column, so it will populate itself. Click OK to close the Destination Editor. Your stgColor table load is complete.

Figure 6: Manual mapping of the StartDate column

At this point, you need to add a step that will truncate the new stgColor table immediately prior to its load. Go to the Control Flow tab of the SSIS package. Drag an Execute SQL Task from the Toolbox onto the design surface. Rename the Execute SQL Task to something like Truncate stgColor. Drag the green arrow from the Execute SQL Task to the Data Flow Task (i.e., the Load Color Dimension Stage Table task). Double-click the Execute SQL Task to edit it. Open the Connection property and select the connection that points to the Stage tables. Open the SQLStatement property and enter the statement

TRUNCATE TABLE stgColor

You might be thinking it's counterintuitive to set up a new unique ID column (DWColorID) to use as the data warehouse key if you're going to truncate and reload it every day. You're correct in thinking this doesn't make much sense. Here's the scoop: Advanced data warehouses, especially those with large data sets, typically have sophisticated processes that increment the loads daily. In a data warehouse that uses incremental loads, you don't truncate and reload every row every day. Instead, you preserve the data from previous days, perform comparisons, and conduct incremental changes. For this example, you're creating the data warehouse key (DWColorID) and date field (StartDate) so that the structure is in place. That way, you can switch from full daily reloads to incremental loads without having to change the dimensional schema.

The process to create the Color dimension table is complete. As Figure 7 shows, it's made up of a truncate step and a load step, which are put into a Control Flow component called a Sequence container. (Sequence containers are used to better organize the flow in the package.) As your next steps, you would create similar two-step truncate and load processes for the other dimensions, except for the Geography dimension. The process to create the Geography dimension is significantly different because it uses data found on the web. For now, I'll skip that discussion so I can talk about the all important task of flattening as it's used to create the Fact Orders table. (I'll cover the Geography dimension in the next article in this series.)

Figure 7: Two-step truncate and load process

Creating the Fact Orders Table

To create the Fact Order table, you need to add new Data Flow Task to the SSIS package and add an OLE DB Source component to that task, following the instructions I provided in the "Creating the Color Dimension Table" section. After opening the source component, you need to add SQL code that flattens the Order, Order Detail, Product, Customer, and Customer Address tables into a set of denormalized rows. Listing 2 shows this code.

SELECT
    o.OrderDateISO
  , o.OrderID
  , OrderDetailID
  , o.CustomerID
  , ZipCode
  , od.ProductID
  , StyleID
  , ColorID
  , LineID
  , PatternID
FROM rawOrder o
JOIN rawOrderDetail od
ON o.OrderID = od.OrderID
JOIN rawCustomer c
ON o.CustomerID = c.CustomerID
JOIN rawCustomerAddress ca
ON o.CustomerAddressID = ca.CustomerAddressID
JOIN rawProduct p
ON od.ProductID = p.ProductID

By using a Data Flow Transformation called a Lookup, you can use the ColorID values in the original Product table to look up the DWColorID values in the Color dimension table. The DWColorID values are eventually loaded into the Fact Orders table. To set up this transformation, drag a Lookup transformation onto the design surface of the Data Flow tab. Drag a green arrow from the source component to the Lookup transformation. Double-click the Lookup transformation to edit it. In the General screen, use the following settings: Full Cache (the fastest), OLE DB Connection, and Redirect rows to no match output.

Go to the Connection screen. You'll be using the stgColor table you just loaded, so select the Stage connection. In the Use a table or view drop-down list, choose the stgColor table.

Go to the Columns screen, where you'll see the columns from the fact table on the left and the columns from the lookup table on the right. Create a connection between ColorID in the left box and ColorID in the right box, as Figure 8 shows. Think of this connection as an equals sign. Select the DWColorID check box to note that you want to pass that column out as the "looked-up" column. Click OK to return to the Data Flow tab.

Figure 8: Configuration of the DWColorID lookup column

Add a Derived Column transformation, renaming it to something like Set DWColorID to zero for failed lookups. This transformation will create a derived column named DWColorID and fill in the dummy value of zero for any ColorID that wasn't found in the stgColor table. Drag a green arrow from the Lookup transformation to the Derived Column transformation. Because there are two expected outputs from the Lookup transformation, SSIS will prompt you for a selection. You want to select No Match Output because only those rows that failed the lookup need to have a zero assigned as their DWColorID.

Now drag a Union All transformation onto the Data Flow tab. Name it something like All rows now have DWColorID assigned. Drag the remaining green arrow from the Lookup transformation (the Match Rows output) and connect it to the Union All transformation. Then drag the green output arrow from the Derived Column transformation and connect it to the Union All transformation. This completes the DWColorID lookup process.

After creating this lookup process, you need to create similar lookup processes for the other dimensions, such as Pattern and Style. For example, you'd use the PatternID values in the original Pattern table to look up the DWPatternID values in the Pattern dimension table. After you're done with all the transformations, it would be time to load all the looked-up values (e.g., DWColorID values, DWPatternID values) as well as some of the original source values into the Fact Orders table. Figure 9 shows this flow with just two lookup processes (of the implied five lookups) for illustration purposes.

Figure 9: Two lookup processes in a fact table load

Stay Tuned

In this article, I introduced you to the Five F Words of the staging step and covered one of them-Flatten-in detail. I demonstrated how to flatten two or more tables to create dimension and fact tables. I also demonstrated how to use the Derived Column, Lookup, and Union All transformations. In the next article, I'll discuss the other four F Words (Fix, Flag, Filter, and Figure).

Comments

Plain text