Data warehousing means more than having a large database; it also includes the process you use to manage your data warehouses and data marts. To discuss this process, we'll organize data warehousing into three tasks: data acquisition, data transformation, and data presentation. Data acquisition is moving data from anywhere to your warehouse. After you acquire the data, the second task is data transformation: You scrub and validate the acquired data and then map the source data to a uniform set of values (i.e., you transform the data). You can add data to your warehouse only after you validate, transform, and verify the integrity of the data. The third task, data presentation, entails getting the cleansed data from the warehouse to users so they can easily access it.
Data Transformation Services (DTS) provides the framework and tools for developing, automating, and managing the first two tasks. In this article, we'll discuss how DTS carries out data acquisition and data transformation.
Without DTS, transferring data from the source systems into the warehouse is laborious because source data is often in different formats and runs on different platforms. You can acquire data in two ways: Purchase an expensive third-party tool that provides native drivers for accessing each target system and a scheduling utility to coordinate the transfers. Or you can resort to the lowest-common denominator by extracting data as an ASCII text file and using batch processing to import it into the target system. Most organizations choose the second option. Although this option appears simple, by the time you design, develop, and deploy the solution, it is as complex, difficult to maintain, and costly as the first option.
DTS uses OLE DB to tackle data acquisition. OLE DB is Microsoft's solution to enable Universal Data Access (UDA). OLE DB goes beyond ODBC's capability to access only relational data. OLE DB defines a set of COM interfaces that let you access and manipulate any data type. Acting as an OLE DB consumer, DTS extracts data from any data source that acts as an OLE DB provider (i.e., offers a native OLE DB interface) or any data source that offers an ODBC interface. For example, a company hosts its inventory tracking system in a DB2 database running on a mainframe, hosts its accounting system in an Oracle database running under UNIX, and stores its human resources data in a SQL Server database. The DTS Package Designer creates an OLE DB connection to each data source to pull all this information into a SQL Server warehouse. As part of the transformation process, DTS later maps the source data fields to destination data fields through the data source connections.
In data transformation, you scrub, validate, aggregate, and summarize data after you acquire it and before it reaches its target storage location. Without DTS, this process consists of several steps. First, you must establish an intermediate holding area to store the source data, for example, staging tables in the database or storing text files on the file system. Second, you must develop applications to scan, analyze, and validate the intermediate data. Finally, you must develop additional routines to move the cleansed data from the holding area into the warehouse. Although these steps are necessary to ensure the integrity of the data, they increase complexity and add points of failure. The higher level of complexity and increased risk of failure result in higher manageability costs.
DTS' name comes from the data transformation task. DTS uses the Data Pump to simplify the data transformation process and reduce the number of failure points. The DTS Data Pump is a high-speed, in-process COM server that exposes the data stream for direct manipulation by any COM-compliant language, as Figure 1 shows. The Data Pump programmatically exposes the data stream for manipulation via ActiveX scripting languages. Developers can then validate, manipulate, aggregate, and redirect information.
Beyond the Data Pump, DTS exposes a complete set of COM interfaces that let developers create, maintain, and execute packages and let packages interact with external applications. This capability lets developers use any COM-compliant language to develop highly customized applications that use DTS' built-in functionality.
The data acquisition task requires a means to schedule and monitor data extractions. In addition, the data transformation task requires flow control. What happens when a job fails? How do you automate your process and coordinate parallel and serial steps? How do you set up conditional execution (e.g., run job B only if job A's return code is 40)? Answers to these questions are as important as the data extraction and transformation jobs.
DTS uses built-in scheduling and workflow functionality to handle such scenarios. Each package contains one or more steps that define the flow and execution of tasks within the package. In addition, DTS uses precedence constraints to further define the workflow. Figure 2 shows an example of data transformation steps executed serially and in parallel based on precedence. Figure 3 shows data transformation steps executed conditionally. DTS' ability to control and manage its processing fills a void that looms over most enterprise warehousing projects.
Putting DTS to Work
To illustrate how DTS works, we'll use the Package Designer to script a package that simulates bringing multiple data sources together into one data store. This process is a common warehousing scenario.
Suppose a large corporation needs to combine the employee data from its two subsidiary companies into one data mart and refresh that data weekly. We'll use the Employee tables from SQL Server 7.0's Northwind and pubs sample databases to simulate the subsidiary companies. We'll create a new database called HRMart to represent the parent company's data mart. Listing 1 details the HRMart schema.
We must create a new DTS package to begin the database assignment. A package is a self-contained description of all the tasks required to complete an import, export, or transformation process. Enterprise Manager, which you see in Screen 1, provides the graphical interface for DTS package creation, management, and execution. To create a new package, we right-click the Data Transformation Packages folder, and click New Package. This action opens the DTS Package Designer, which provides tools to create and edit packages.
Next, we must define the source and destination data stores for our transformation. DTS, acting as an OLE DB consumer, uses connections to define the OLE DB data provider that will furnish the data for processing. Within the Package Designer, the Data toolbar and Data menu display the available connection types. To add a connection to the package, drag it from the Data toolbar to the design window. Or, you can right-click in the design window and click Add Connection. After you add a connection to the package, you must configure it before the package can use it. Configuring connection properties is similar to configuring an ODBC connection, as Screen 2 shows. For the first part of our example, we'll create two connections, one for the pubs data and another for the HRMart data.
Now that we have established the data providers, we have to define the source and target data sets. You define the data sets in the Transform Data task. This task creates the Data Pump that will control the data stream and apply any transformations. To add this task, we must first select both the source and target connections. Then, we click the Transform Data icon (the arrow pointing to the right) or select Add Transform from the Workflow menu.
You must pay close attention to the order in which you select the connections when you add the Transform Data task. After you select the connections, you can't reverse the source and target connection without deleting and re-adding the task. To be safe, always select your source connection first and then your destination connection.
After we add the Transform Data task, we have to define the data sets. We right-click the task and select Properties. This action opens the Data Transformation Properties dialog box, which Screen 3 shows. On the Source tab, we define a query to return the employee data we need. To define the query, we directly enter our query in the SQL Query window or click Build Query to use the Query Designer. As the data source, we can use any valid SQL query such as a simple SELECT, VIEW, or stored procedure. No matter which method we use, we check our query for typos by clicking Parse Query. We're working with one source table; consequently, we could have selected the Table Name option and included the entire table in the query. After we define our query, we click Preview to view the Preview Data dialog box, which Screen 4 shows.
Next, we define the destination. For our example, the target table in the Table Name text box is HRMart.dbo.Employee. We don't make any changes to the Destination tab. To create a new destination table, we can click Create New. By default, DTS provides a destination table definition that matches the result set our source query defined.
Now that we have defined the source and destination data sets, we define how the columns map between the two. We map the columns on the Transformations tab of the Data Transformation Properties dialog box, which Screen 5 shows. By default, DTS maps each source column to a destination column in column order. If the source and destination columns don't match, we need to unmap the defaults and remap to the correct destinations. Next, we have to choose how to map our data. We can use the Column Copy option to remap each source column to a destination column. This method is the default. We can use the ActiveX script to map the individual columns. Or, we can use a combination of the Column Copy option and the ActiveX script. For this example, we'll use the combination method.
Before we add our transformation, we must delete the default column mappings by selecting and deleting each transformation line between the source and destination tables. For our example, we must unmap all the defaults. Our source query doesn't account for all the destination columns, so we have to give some destination columns default values. Next, we redefine the mappings for fname, minit, and lname by selecting them in the source table, selecting their corresponding columns in the destination table, and clicking New to create a New Transformation of type Copy Column.
To verify that we mapped the columns correctly, we double-click the transformation line between the two tables or right-click the transformation line, and click Properties. Screen 6 shows the Column Order dialog box in which we can adjust the mappings. The Column Order dialog box displays only the columns we select on the Transformations tab. We mapped the three columns with one transformation; however, we could have mapped them individually.
Next, we'll add the transformations for the remaining columns. For our example, we'll transform all the remaining columns in one ActiveX script. We start by selecting the remaining columns from the source and destination tables and clicking New to create a New Transformation of type ActiveX Script. Clicking New opens the ActiveX Script Transformation Properties dialog box, which Screen 7 shows. We can implement our custom transformations from this dialog box.
For our example, we'll use VB Script, but we can use any installed ActiveX scripting language, such as JScript or Perl Script. By default, DTS generates a script that maps the selected columns in column order. Again, we have to unmap these transformations and replace them with the correct transformations.
The DTSDestination and DTSSource objects represent the connections that we established to the pubs and HRMart databases. These objects expose the data stream and provide the interface in which we manipulate and transform the columns of data one row at a time. Using a script to transform the remaining columns lets us take advantage of all the functionality a scripting language offers, including the ability to leverage COM objects outside the DTS environment. We can click Parse to parse our script to check whether it's syntactically correct and Test to test the script's logic. The test script writes the data to a temporary file rather than to the destination table.
The Advanced tab in the Data Transformations Properties dialog box lets us increase control over how DTS performs the transformations. For our example, we'll use the default settings.
Next, we'll add a connection to the Northwind database and a Transform Data task that will move the data from Northwind's Employee table to HRMart's Employee table. We follow the same steps as we did with the pubs to HRMart transformations. Listing 2 details the SQL query you can use to extract data from the Northwind database. And Listing 3 shows the VB Script you can use to define the Northwind data into HRMart data transformations.
We now face the problem of coordinating the data refresh. We want to schedule a refresh of the HRMart data each week. To complete this assignment, we need to add an Execute SQL Task to our package to purge the HRMart Employee table. We can add an Execute SQL Task to the design sheet by dragging it from the Tasks toolbar; right-clicking in the design window, and selecting Add Task; or selecting Tasks, Add Tasks.
Next, we must define the SQL task. The Execute SQL Properties dialog box uses an existing connection to reach its target data. We'll use the previously defined HRMart connection. We can use any valid SQL statement, including stored procedures, in the Execute SQL Properties SQL statement window. We'll use a simple DELETE FROM Employee SQL statement to purge the table.
Now, we need to ensure that our tasks execute in the correct order. First, we need to purge the HRMart data. If the purge is successful, we need to parallel process the pubs to HRMart and Northwind to HRMart transformations. DTS lets developers assign precedence constraints to tasks, including On Success, On Failure, and On Completion. By assigning these constraints in different combinations, developers can implement serial, parallel, and conditional processing.
To establish task precedence, select the source step (HRMart Purge SQL task), then select the destination step (the pubs connection). Next, select Workflow, On Success. This action establishes the precedence with the pubs to HRMart transformation task. You use tasks only in workflow; although we graphically select the pubs connection as the destination step, this selection represents the beginning of the pubs to HRMart transformation task.
DTS offers another method to define precedence. To illustrate this alternative, we'll use it to define the HRMart purge and Northwind to HRMart precedence. First, right-click the Northwind to HRMart transformation task, and click Workflow, which opens the Workflow Properties dialog box. In this dialog box, define any precedence constraints necessary for this task. We select the HRMart Purge task as the source task and Success as the precedence constraint. This configuration tells DTS to run the Northwind to HRMart transformation task only after the HRMart Purge task successfully completes. We enabled the two transformation tasks to execute in parallel. You can use the Options tab in the Workflow Properties dialog box to improve control over the operation. For our example, we'll use the default settings.
Next, we need to save the package to our local server, as Screen 8 shows. In addition to standard SQL Server user security, DTS applies security at the package level. We omitted package-level security. Screen 9 shows our completed package.
SQL Server Agent
You can use the SQL Server Agent or the NT Schedule service to schedule the package for periodic execution. We used the following command-line syntax to execute the package (type as one line):
dtsrun /S Acorn /N "HRMart Refresh" /U sa /P
The dtsrun.exe utility lets you execute, retrieve, delete, and overwrite any DTS package. We supplied the server name (/S), the package name (/N), a SQL Server user ID (/U), and the user ID's password (/P) as parameters.
DTS brings a new dimension to data warehousing. We discussed how DTS attacks two troublesome warehousing tasks, data acquisition and data transformation, and demonstrated how easily you can use the DTS Package Designer to script a package. However, we barely touched on the various COM interfaces that DTS exposes for developing customized transformations. We'll explore some of these programmatic solutions in future articles.