There are two ways that data warehouses often get built. Occasionally, someone in the database trenches pitches the idea to the higher-ups, but more often someone at the top issues a directive to build one. Being the recipient of such a directive can be frightening, especially if you have to quickly build one and you have little or no experience with using SQL Server Integration Services (SSIS). Fear not—I'll guide you in building a data warehouse, while conveying some of the philosophical foundations of data warehousing. However, I won't get into the esoteric topics that seasoned business intelligence (BI) professionals like to discuss when they geek-spar.
Providing instructions on how SSIS novices can build a data warehouse is too big to be covered in just one article, so this is the first article in a series. In this article, I'll show you the basic structure of a data warehouse and how to make an SSIS package move data from the source system into that structure.
Setting the Stage
Suppose you're the database developer and DBA for a small company, StateUBags, that sells backpacks and messenger bags online to students at your state's major university. You built a database that enables an online shopping-cart application. Your shopping-cart back end is a SQL Server database named SUBTrxn (short for StateUBags Transactions). When building SUBTrxn, you paid careful attention to getting the data into third-normal form. You applied the most genius keys and indexes, so SUBTrxn has solid referential integrity. You regularly partition and archive old sales data to keep the tables small and light.
Besides being the database developer and DBA, you're also the point person for providing reports. You hate this responsibility because you're constantly bogged down by one-off requests for data sets. You often have to resurrect data from backups, which is so cumbersome that report requesters receive their data sets long after they would've been useful. You have a folder full of SQL code that you tool and retool constantly in an effort to speed up your report-writing activities. You can't remember how most of your stored procedures work, but they seem to work well enough. You're constantly tiptoeing around land mines in the result sets by explaining the very valid reasons why the measures in one report don't match the measures in another. You hope no one with any SQL expertise ever sees the one stored procedure you rely far too heavily on—the one with the nested subquery pulling from a view that pulls from a view.
Recently, the boss of your boss's boss sent word down the command chain that he'd like you to build a data warehouse. You're thrilled, yet scared since you're not experienced with using SSIS.
The only requirement you've been given is "The suits want to be able to analyze sales, in an ad-hoc fashion, over the last five years." Your boss assures you that all you have to build is the database portion. After you complete it, a contractor will build a SQL Server Analysis Services (SSAS) cube on top of it. You have only two weeks to build it.
How do you go about building the database portion of a data warehouse? I'll walk you through the steps using a SQL Server 2008 R2 machine. Before you start, though, it helps to have a simplified representation of the current pre-warehouse environment. As you grow the current environment into a fully operational data warehouse, you'll be able to refer back to this representation as the starting point.
Figure 1 shows the current environment for StateUBags. As you can see, data is being exchanged between the SUBTrxn database and the website that the customers use to view and purchase products, as well as view their purchase history. The back end uses slow, fragmented SQL scripts and stored procedures to output ad-hoc reports.
Step 1: Add the Databases
The first step is to add databases that will house sets of tables containing raw, staged, and dimensionally modeled data. I'll refer to these sets as the Raw tables, Stage tables, and Dimensional tables, respectively.
The Raw tables house near-exact copies of the source data (in terms of both the data's structure and values) from the transactional or other external system. As a general rule, data warehouse developers try to make the Raw tables exact copies. Data warehouse developers should only depart from this rule when there are compelling and defensible reasons to do so. What constitutes compelling and defensible? The burden of proof rests on the developer who made the decision. So, make your Raw tables as "raw" as possible—and if you decide to calculate (aka figure), fix, filter, flatten, or flag before the data lands in the Raw table, be prepared to defend your decision.
The data in Stage tables significantly differs from that in the Raw tables in terms of the data's structure values. When you move the data from the Raw tables to the Stage tables, you apply business rules, flatten two or more tables into one, mark records for filtering, or perform other activities. In other words, anything you need to do to make the data ready for the data warehouse, you do during the move from Raw to Stage.
The Dimensional tables store the data long-term in a format that's tailor-made for use by SSAS cube files. Dimensional tables can be fact tables or dimension tables that come together in star, snowflake, and constellation schemas.
For the example used in this article, you need to create two new databases—SUBStg and SUBDW—as seen in Figure 2.
The SUBStg database will contain the Raw and Stage tables. Placing the Raw and Stage tables in the same database is often practical, especially in the early stages of development when the warehouse is still small enough for that to be feasible. As your data warehouse grows, you might need separate databases for the Raw and Stage tables. You might even want to have these databases on separate servers.
The SUBDW database will contain the Dimensional tables. These fact and dimension tables will be structured expressly for serving up data to be used by SSAS cubes.
Step 2: Check for Software
Now that you have your databases in place, you need to make sure that you have the necessary software. First, you need to confirm that you have SSIS installed on your machine. In Control Panel, select Administrative Tools, click Services, and look for SQL Server Integration Services in the list. You'll be using an SSIS package—specifically, an extraction, transformation, and loading (ETL) package—to create the tables in the data warehouse.
Second, you need to make sure that you have Business Intelligence Development Studio (BIDS) installed. Choose All Programs on the Start menu, click Microsoft SQL Server 2008 R2, and look for Business Intelligence Development Studio in the list. If you don't find BIDS listed, you'll likely need to re-install it from the SQL Server installation DVD. You'll be using BIDS to edit the ETL package.
Step 3: Create the Raw Package
The next step is to create the ETL package that will move data from the source system (i.e., the SUBTrxn database) into the Raw tables, as Figure 3 shows. I'll refer to this ETL package as the Raw package.
You can use the Import/Export Wizard in SQL Server Management Studio (SSMS) to create the Raw package. Although this isn't a common practice, it's a useful method to know even if you decide not to use it much in the long term. It provides an easy-to-learn, reliable path into SSIS development.
In SSMS, locate your source database (in this case, the transactional database) in the Object Explorer pane. Right-click that database, select Tasks, and choose Export Data to bring up the Import/Export Wizard. You've probably used this wizard to move data around in your databases, but you might not have saved your work as an SSIS package.
When the wizard appears, you might see an informational screen. If you do, click Next to proceed to the first configuration screen. Following the guidelines shown in Table 1, complete the first three configuration screens.
On the fourth configuration screen titled Select Source Tables and Views, you'll be presented with a list of all the tables in the database you chose as your source database. When you select a table's check box, the wizard fills in the Destination column with a table that has the same name. As Figure 4 shows, you can change the existing table's name to the name of a table that doesn't yet exist. The wizard will then create the new table for you and populate it with data from the source table. For this example, you need to add the word raw to the beginning of each table name. Prepending the Raw tables' names with raw and the Stage tables' names with stg is an easy way to keep the two types of tables separate when they're stored in the same database.
Try to select at least 10 tables from your source list. Almost any lookup table in your transactional database is a good candidate to become a dimension table in your warehouse, so choose nine lookup tables for now. You'll want to choose at least one table that represents a transaction table or an event table. Most of the time, transactions and events eventually become fact tables. (I'll be covering how to make these decisions in a future article.)
If you haven't used the Import/Export Wizard much in the past, I recommend that you click the Edit Mappings button after you're done selecting and renaming the tables. It's helpful to know about the options that are available to you.
When you're finished with the Select Source Tables and Views screen, click Next to proceed to the Save and Run Package screen. Here you need to configure some important options:
- Clear the Run Immediately check box and select the Save SSIS Package check box. At this point, you only want to save the package.
- Elect to save the package in the file system instead of the default option of SQL Server.
- Set the Package Protection level to Do not save sensitive data.
Click Next to advance to the Save SSIS Package screen. Type in a name for the Raw package and use the Browse button to navigate to the location in the file system where you want to save it. (Don't save it inside a database.) Click Finish to save the Raw package as a .dtsx file. In the screen noting that the operation was successful, click Close.
If you were to double-click the .dtsx file, it would open a window that would allow you to run the Raw package. However, executing the package at this point wouldn't be useful, so you don't need to run it. Instead, you need to incorporate this package in a project.
Step 4: Add the Raw Package to a Project
It's now time to create the project that will contain the Raw package. Begin by launching BIDS. On the File menu, choose New and select Project. Under Project Types, choose Business Intelligence Projects and select the Integration Services Project template. Name the solution as you see fit, and put it in any location that works for you. Click OK.
When your newly created project opens, you'll see the layout BIDS uses for SSIS projects. As part of the automated project setup process, BIDS has created a package named Package.dtsx and opened it in the main window. There are four tabs across the top of this window. You'll be using the Control Flow and Data Flow tabs, but not the other two tabs.
Before you can use the Control Flow and Data Flow tabs, though, you need to add the Raw package you created to the current project. In Solution Explorer, right-click the SSIS Packages folder and select Add Existing Package. In the dialog box that appears, select File System from the drop-down list for the Package location option since you saved the package in a file system instead of a database. Click the button labeled with an ellipsis (...), locate and select the Raw package, click Open, then click OK.
Note that you should use the Add Existing Package option to add a package rather than manually moving it into the project folder. If you manually move the package, it won't retain the right filename—BIDS will append the number 1 to it. When you use the Add Existing Package option and point to the file in a location outside of the project folder, BIDS makes a copy of it in the project folder without renaming it.
After the package is added to your project, double-click it to open it in the main window. Now that this package is open, the main window will show details about it. First, take a look at the Connections Managers area, which is in lower portion of the window. Notice that you have two Connection Managers: DestinationConnectionOLEDB and SourceConnectionOLEDB. Double-click to open each connection manager and look around. When you have a good sense of what properties it contains, click Cancel.
Now look at the Control Flow tab in the main window. It'll show the tasks in the package. The number of tasks you'll see is based on the number of tables you elected to copy in the Import/Export Wizard's Select Source Tables and Views screen:
- If you chose five or fewer tables, you might only have two tasks on your Control Flow tab. Following along should still be possible, but if you find it overly confusing, you might want to start over and choose more tables the second time around.
- If you selected six or more tables, you'll see that the wizard created two tasks for each group of five tables.
As Figure 5 shows, the sample Raw package lists four tasks. The first five tables are covered by the first two tasks. The second five tables are covered by the third and fourth tasks. The two types of tasks are Execute SQL Tasks and Data Flow Tasks.
Although the Import/Export Wizard already created the tasks, looking at their configurations is helpful, especially if you're new to SSIS. So, double-click Preparation SQL Task 1, which is an Execute SQL Task, to open it. Execute SQL Tasks are simple to set up and run. The most important settings are Connection, which you use to pick the connection manager you want to run your SQL code on, and SQL Statement, which you use to paste your code into the SQL Statement box. The connection manager is recorded in the Connection property and T-SQL code is written to the SQLStatement property. If you hover over or open the SQLStatement property for Preparation SQL Task 1, you'll see that it simply creates the Raw tables. Close the task.
In the Control Flow tab, right-click Preparation SQL Task 1 and select Execute Task. The task will turn yellow, then green. When you execute a task this way, you're automatically switched to debugging mode. To return to the design mode, select Stop Debugging on the Debug menu. Now, execute the task again. This time the task will turn yellow, then red. This occurs because those tables were created when you ran the task the first time.
To avoid this problem, you need to change the code in both of the Preparation SQL Tasks. Rather than dropping and recreating the tables each day, let's truncate them. Double-click Preparation SQL Task 1 to open its properties page. On the General tab, locate the SQLStatement property. When you place your cursor in the SQL Statement field, a button labeled with an ellipsis appears. Click it to launch a larger edit window.
In the SQL Statement code, replace each CREATE TABLE statement with a TRUNCATE TABLE statement. Save the edits, and return to the Control Flow tab. Execute the task again and watch it turn green this time. (If it turns red, check the syntax in your code). Return to the design mode, and repeat these steps for Preparation SQL Task 2.
When you're done changing the code, double-click Data Flow Task 1 in the Control Flow tab. It will seem as if you're opening the task but you'll actually move to the Data Flow tab in the main window. The Data Flow Task drop-down list near the top of this tab lets you move between different Data Flow Tasks without having to return to the Control Flow tab. (If you have only one Data Flow Task, the drop-down list will contain only one entry.)
Take the time to explore the tasks in the Data Flow tab so that you become familiar with how they're configured. When you are satisfied that you understand their inner workings, return to the Control Flow tab, right-click Data Flow Task 1, and execute it. Return to the Data Flow tab and watch the task run as it gives you status information using yellow, red, and green indicators, as Figure 6 shows. Another key bit of information it provides is a row count, which can be helpful in more complex packages. Return to the design mode by stopping the debugger.
Step 5: Test the Raw Package
It's time to test the entire Raw package. In Solution Explorer, right-click the package and choose Execute Package. (If you aren't given that option, make sure that you stopped the debugger.)
When you execute a package, each task will run in succession, turning from yellow (i.e., running) to green (successfully completed the task) or red (failed to complete the task). When every task turns green, shout "Green!" and put two fists in the air. This is an essential step in SSIS package building.
If a task turns red, you can try to fix the problem immediately or you can temporarily disable the task. To do the latter, you need to return to the design mode by stopping the debugger, right-clicking the task you want to disable, and choosing Disable.
Disabling tasks within the Control Flow tab is useful for troubleshooting because it allows you to continue testing other parts of the package when one or more tasks have outstanding issues or incomplete logic. However, disabling a task only solves the problem in the short term. You'll still need to find and fix the problem later. Note that you can't disable Data Flow Tasks, no matter what tab you're on.
So far, you created two databases to house the data that will be in the data warehouse. You also learned a quick and easy way to build the SSIS package that will move data from the source system into the Raw tables. Although SSIS experts would probably build the SSIS package from scratch instead of using SSMS's Import/Export Wizard, the wizard is a boon for SSIS novices who must build one quickly. In the next article, I'll show you how to use the same technique to build the package that will move data from the Raw tables into the Stage tables.