Build a Simple ETL System with SSIS

Downloads
93114.zip

You probably know that you can use SQL Server 2005 Integration Services (SSIS) to build an Extract, Transform, and Load (ETL) system to populate a data warehouse. But how do you do it? What does an ETL application look like? SSIS boasts so many features that newcomers typically feel as if they've upended a jigsaw puzzle on their desktop. It's hard to put the puzzle together without seeing the big picture, especially when pieces from other puzzles are mixed in or when some pieces are missing! Our goal in this article is to provide that big picture: We'll cover basic SSIS package design and construction, and in doing so we'll provide a foundation for later studying incremental load techniques for dimensions and facts, as well as variables, scripting, and basic process auditing.

Establishing a Source

Before you start building an ETL system to populate your data warehouse, you should have developed the relational dimensional model to meet the business requirements of your user community. (For some foundational information, see "Dimensional Modeling Basics," April 2006.) You should have the source-to-target mapping that specifies where each column in the target data warehouse comes from, including a brief outline of the transformation rules. Also, you should have thought through some design and architectural concerns so that your ETL system is logical and consistent.

In this article, we use the familiar AdventureWorks database as the source system. We've already designed an abbreviated target dimensional model to hold orders data. This target model, which Figure 1 shows, has only four dimensions, including Date, which plays multiple roles. The model is simple (e.g., it has no Customer dimension), but it serves as a reasonable design pattern for building an ETL system. It illustrates dimensional-design best practices, including surrogate keys and several ways of handling changes to attributes' values.

We typically build two ETL systems: The first system loads historical data into the data warehouse, and the second system handles incremental loads. The historical load has to process more data and deal with historical incongruities, whereas the incremental load identifies new and changed rows and processes those changes. Some of the logic in the two systems will be identical, but you should plan to create two systems.

Before you start developing SSIS packages, you should do some planning. For each table in the data warehouse, think through—and document—all the data sources and transformations. The detailed system specifications you write could easily be dozens of pages long. At the very least, create a high-level map, as Figure 2 shows. In this map, the target data-warehouse tables are at the bottom, and the sources for each target table are at the top. Between the sources and targets, you should document the kinds of transformations that need to occur. (In this case, those transformations are simple.) Always note whether the dimension includes only Type 1 attributes (which are updated in place), Type 2 attributes (for which you track history), or both. Specify where you'll be capturing process metadata and performing data-quality checks.

Modularize Your Design

Although it's theoretically possible to write a single SSIS package that would populate all the tables in your data warehouse database, we don't recommend that you do so. Keep it simple by writing one package to populate each table. You should create a master package to kick off the processing of each table-specific child package. Master packages consist primarily of Execute Package tasks.

The first step is to create a project in SQL Server Business Intelligence Development Studio (BIDS) to hold the ETL application. We typically create one solution to hold all the SSIS packages, then create two projects for the historical and incremental loads. Open BIDS, and choose File, New Project. Highlight the SSIS project template. Name the project AWOrders_Hist, name the solution AWOrders, and select the Create directory for solution check box. Save the solution in a convenient location.

The next step is to create shared data sources to connect to the source databases and the target data warehouse database. By using shared data sources, all your packages will be able to share the same connection information. In our example, we're using AdventureWorks as the primary source database, and the new database AWOrders is the target. To create a shared data source, right-click Data Sources in the BIDS Solution Explorer pane and choose New Data Source.

Next, create a new empty package called Promotion, then create a package-specific connection for the source database AdventureWorks and for the target database AWOrders. Right-click the Connection Manager section at the bottom of the design space and choose New Connection From Data Source. The only Control Flow task you'll need for the first draft of the Promotion package is a Data Flow task. Find it in the Control Flow toolbox, and drag it onto the main design surface. You can edit the Data Flow task by double-clicking it or by choosing the Data Flow tab of the design surface.

Setting Up the Data Flow

The Control Flow and Data Flow design panes in BIDS look similar, but they're quite different, and their names help clarify their purposes. The Control Flow is where you control the package's logical flow. You can direct the package to take different paths under different conditions. For example, in the Control Flow, you can halt package execution if the extracted data fails a test such as a minimum number of rows.

The Data Flow pane is where data is extracted, transformed, and written to a table or file. The Data Flow efficiently processes batches of data (e.g., 10,000 rows) at a time. You can't halt package execution from within the Data Flow.

Extracting and Archiving Data

Data Flow tasks can be complicated, so we'll start with a simple one. In Figure 2, you can see that the Promotion dimension is sourced from only one table and requires little transformation. On the Data Flow tab, pull out the OLE DB Source from the toolbox pane and place it on the design space. Configuring the OLE DB Source is trivial: Simply point it to the package connection for AdventureWorks, and enter a source query. The default is to source from a table or view, but we recommend that you always write an explicit source query, as Listing 1 shows.

One reason for the explicit query is that it lets you rename columns and perform trivial transformations such as type casts. You have many opportunities within the Data Flow pipeline to do the renaming and casting, but we recommend doing it as early as possible. The renaming and casting puts negligible additional load on the source database. It's vital that your source queries minimize the load on the source system. Be particularly cautious about joining tables, and work closely with your source system DBAs to ensure that they approve of your extract queries.

Be sure to rename the objects in your package. In this case, rename the OLE DB Source as Source from Sales SpecialOffer. By doing so, you ensure that you'll be able to make sense of what your package does when you look at it a few months later.

Your next step is to save a copy of the extracted data before you perform any additional transformation. Archiving the extract is a best practice—not a technical requirement. Your internal audit department will be happy to see that your ETL system automatically archives extracted data for an established time frame (typically a month or two).Archived extracts can be invaluable in situations in which you need to modify the ETL system and re-run a month of loads, for example.

To archive the extracted data, add the Multicast transform onto the design surface and hook it up to the source. Multicast makes two or more identical copies of the data stream. Add a Raw File Destination to the design surface, and hook it up to the Multicast. The Raw File Destination is a file format that's unique to SSIS. It's easy to configure and writes data very fast. However, the only way to read a raw file is from within an SSIS package, so if you want a person to look at the output file, it's best to use the Flat File Destination. To set up a Raw File Destination, simply specify the destination file path and name, then go to the Input Columns tab to specify the columns you want to store.

Transforming the Data

Whereas archiving the extracted data is an important peripheral task, your primary goal is to transform and write the data to the target table. In Figure 2, you can see that the transformation involves fixing up a NULL value in the MaxQty column, which identifies the maximum quantity for which a marketing promotion is valid. If there is no maximum quantity (i.e., if the promotion applies no matter how many units are purchased), the source system puts NULL in MaxQty. You'll want to replace that NULL value with the largest possible integer that will fit in the column—that is, 2,147,483,647.

Although you can use SQL Server to perform this transformation in the extract query, we recommend moving all nontrivial transformations into SSIS. Although this transformation barely qualifies as nontrivial, it makes a good example. Pull out a Derived Column transform, and hook it up to the Multicast. Replace the contents of the MaxQty column with an expression written in the SSIS expression language, as Figure 3 shows. (No doubt Microsoft had good reasons for coming up with a whole new language for SSIS expressions.) As always, rename the transform—in this case, to Replace NULLs.

Next, drag out a second Derived Column transform and hook it up to the Replace NULLs transform. Create two new columns, InsertAuditKey and UpdateAuditKey, and set them both to -1. Adding audit keys to all tables in your data warehouse is a good idea, and we'll discuss that further in a future article. Setting the keys to -1 will work for now.

You might wonder why we told you to create two Derived Column transforms in a row. You can perform multiple transformations in one Derived Column transform, so what's the value of separating them? The value is in readability and repeatability. The first transform has to do with real data, and the second one has to do with auditing metadata. You can include the auditing metadata transform as a standard component of your packages because it's always configured the same way. Sure, it's marginally less efficient when the package runs, but the difference is tiny.

Loading the Data

Finally, it's time to write the data. Drag out an OLE DB Destination, and configure it to point to the Dim-Promotion table in AWOrders. Go to the OLE DB Destination Editor's Mappings pane and ensure that the correct source column feeds all target columns. Because you changed column names in the source query, everything matches up correctly, so you shouldn't need to edit anything to achieve the mappings that you see in Figure 4.

Whenever you write data to a table, you should explicitly handle errors. Go to the Error Output tab of the Destination adapter, and change it so that error rows are redirected. If for any reason you can't insert a row into the target table, the default is to fail the transform, and hence the package. For now, simply add a Raw File Destination to the error flow to dump any bad rows into a file, and add a Data Viewer to the error flow by right-clicking the red arrow and choosing DataViewers, Add, OK. There are several kinds of data viewers available, but we've found the Grid to be most useful.

Figure 5 illustrates the completed Data Flow. You can test the package by right-clicking the package name in the Solution Explorer pane and choosing Execute Package. If everything works, you should see 16 rows added to the DimPromotion table. The error flow Data Viewer appears, but there are no error rows. The Data Flow in Figure 5 is a template for all dimension-table historical-load packages: extract, archive the extract, transform, and load.

If you re-execute the package, you'll notice that it adds 16 more rows to the dimension table. That's because the dimension's primary key is a surrogate key, and we haven't added any checks to ensure that we're not adding the same row twice. It's common for historical-load packages to delete any existing rows from the dimension table before processing begins. In the Control Flow, add an Execute SQL task to the design space, and set it up to truncate the dimension table. Connect the two tasks with a precedence constraint.

3 More Dimension Tables

We have three more dimension tables to load: Date, Product, and Currency—all structured the same as the Promotion package. There are two Control Flow tasks: an Execute SQL task that pre-deletes existing dimension rows and a Data Flow task that extracts, transforms, and loads. As you can see in Figure 6, the Data Flow step for Product is somewhat more complex than for Promotion, but it's still basically the same. All the packages described in this article are available for download from the Microsoft Data Warehouse Toolkit Book Website at http://www.msftdwtoolkit.com.

The only new piece of the Product package is the Lookup transform. If you refer back to Figure 2, you'll see that we source DimProduct from three tables. In the interest of keeping source system queries simple, our source query included only two of those three tables. (We have only about 500 rows for the Product dimension, so we actually could have joined a dozen tables without hurting anything.)

If you can't or won't join tables in the source query, you can often use the Lookup transform to perform that operation within the Data Flow, as you see in Figure 6. For every row in the flow, Lookup looks up one or more columns in a secondary table.When you set up the Lookup transform, you first specify the connection for the database in which the lookup table resides. Next, you specify a source query that brings back only the columns you need for the lookup table. The default configuration points to the entire table, but because the high-performance Lookup is cached, you want to conserve memory. Finally, you go to the Columns tab of the Lookup Transformation Editor and specify the join column. On the same tab, ensure that you check those columns in the Lookup table that you want to add to the current data flow.

After you've tested each dimension table's package individually, create a master package to tie together their execution, as Figure 7 shows.The master package contains Execute Package tasks. Execute Package tasks are easy to configure: You need only identify the package to be executed.

Finishing Up

You're off to a good start, but you have a lot more to do.Your next step is to finish up the historical load by working through the process of loading the fact table, which you can postpone for now.

In a future article, we'll discuss the next part of the ETL process, which includes creating packages to handle incremental processing and fact table loads.The challenge for incremental dimension processing is to take care of changes in dimension attribute values. An enterprise-class SSIS system will use variables extensively, and will let you replace the placeholder values for your auditing columns. Most production SSIS systems also use some scripting to tie things together.

Most SSIS systems use only a subset of the tasks and transforms that are available. In the simple packages that we described in this article, we used 3 of the 40 Control Flow tasks: Data Flow, Execute SQL, and Execute Package.And we used 7 of the 45 Data Flow objects: OLE DB Source and Destination, Multicast, Raw File Destination, Derived Column, Lookup, and Union.You can use these basic building blocks to get started.

Comments

Plain text