First introduced in the 1930s as wallpaper cleaner, Play-Doh is now used by children to make all manner of things, and they’re limited only by their imaginations, manual dexterity, and the amount of Play-Doh available. Similarly, when you code transformations—the T in ETL (extraction, transformation, and loading)—you are limited only by your imagination and the data available. SQL Server Integration Services (SSIS) provides powerful tools for transforming the raw Play-Doh of data into useful, meaningful tools to help a business thrive. This article aims to demonstrate a few of the many ways you can handle data transformation tasks in SSIS.
Two types of transformation can occur in SSIS. First is a simple data type transformation, such as extracting data from a comma-delimited ASCII file (where the data is stored as an ASCII string) and transforming it into an integer. Second is changing the format of data, such as taking a flattened, comma-delimited file and storing the data more efficiently in third normal form.
In both cases, you are changing the form of the data but not its substance, in the same way the form of the Play-Doh is changed without changing its substance. If your hypothetical ASCII file data indicates that a company sold five widgets, you don't change the data to say that four or six were sold—you simply change the data type from an ASCII string to an integer. Another example would be that when you obtain derived data, such as the total sales for a day, you don't change the underlying facts but rather draw conclusions from them. You can also use the transformation step to ensure data fits predetermined business rules. If it doesn't, you typically want to raise an error or create exception reports so that any problems can be analyzed and dealt with appropriately.
Transformation can raise some rather profound philosophical questions, if you are of such a mind. For example, it raises questions of form and substance similar to those raised in Aristotle's Metaphysics or the writings of Marcus Aurelius and of Ignatius of Antioch on transubstantiation. The substance of the data—what it is in its essence—remains the same while its form—the data types—may vary. (Who says those philosophy credits and a Christian Brothers education were wasted? But I digress…)
It almost goes without saying that there are all manners of business rule that can or should be enforced during the ETL process. For example, dollar values of a sale not involving return items shouldn't be negative numbers, so you would have a rule (e.g., enforced by a constraint) that a monetary value associated with a sale must be greater than zero. You might have a rule that a sale should never involve a product the company doesn't actually sell, and you could enforce this rule with a foreign key constraint comparing the SKU in a product table with the SKUs involved in any given sale. If the constraint throws an error, then we would handle it in a WTF (Where's This From?) error routine. For more information about the rules you should enforce in your SSIS environment, see "SSIS Logging Best Practices."
Why do you need to transform data? The most obvious reason is that data stored and optimized for use in one system will rarely match the storage and optimization needs in a second system that's optimized to meet other, usually radically different, needs. Because you won't have direct mappings between source and targets in the vast majority of data transfer situations, you must consider various options. You must consider where you should perform the transformation. You could, for example, simply dump an entire large file into a staging table, then transform it using SQL while the data is being moved to the "permanent" tables. You could also use various tools, such as SSIS, to transform and move data to the application side. Or you could combine these efforts. Specialty ETL tools are available for various needs or industries, such as Microsoft's Amalga for medical systems integration. (Such specialized tools are beyond the scope of this article.)
With this background, let's consider a hypothetical set of business requirements. Big Box Stores owns and operates diverse retail chains that include huge Big Box warehouse stores, large retail operations, etc. The stores operate in the United States., Canada, Mexico, and Western Europe, and the brands have modified the POS systems to meet local tax and regulatory requirements.
The loss prevention department has noticed that some employees are helping themselves to five-finger discounts. The staff members use various ruses to take money from cash registers, obtain goods without paying for them, or otherwise embezzle money or steal goods from Big Box. These activities typically unfold over periods of several days or weeks. For example, employees will make purchases using the employee discount at the store where they work, then return the product for full price at another store where they aren’t known. Alternatively, they might have accomplices return the goods to the employee for a full refund.
The various methods used to steal from Big Box fall into these recognized patterns, and a good deal of this theft can be uncovered by analyzing patterns of sales transactions. Standard ETL techniques will be used to import data concerning the stores, products, and employees to a database used to analyze these patterns and detect employee theft.
You have been tasked with building a system that will import comma-delimited files exported by the POS systems into a SQL Server database that will then perform the analysis. Data concerning each sale will be sent from the POS systems.
A full day's data can also be extracted from POS systems in the comma-delimited format that I will discuss below. The web service would expose the data using the natural hierarchy of sales header and sales detail.
The different POS systems use different data types to identify stores, employees, products, and sales transactions. The load job must account for this and properly relate the data from the store to the master data loaded in a separate application. The data will be sent in two comma-delimited files, one containing the sales header data and one containing the sales details. The data format is shown in Listing 1.
Sales Header SalesID, StoreID, EmployeeID, EmployeeFirstName, EmployeeLastName, RegisterID, RegisterLocation, storeAddress, StoreCity, StoreProvince, StorePostalCode, CustomerID, CustomerFirstName, CustomerLastName, CustomerPostalCode, Date, Time, Method of Payment, CreditCardNumber, TotalSales, Amount Tendered, Change, PriorSalesID, Return Sales Detail SalesID, ProductID, Quantity, markedPrice, ActualPrice, ReturnItem, DiscountCode, DiscountPercent, DiscountDescription, OriginalPurchaseDate, OriginalPurchaseStore, OriginalPurchaseSalesID, originalCustomerID, OriginalFirstName, OriginalLastName, OriginalStoreID, OriginalRegisterID, OriginalEmployeeID
Your mission is to move this data into a data mart that will use a standard star schema for analysis. Big Box intends to prosecute employees for larceny or theft based on the evidence this system gathers. Because evidence gathered through this process must stand up in court, it's vital that the data be correct, with minimal errors or problems.
The Role of SSIS
SSIS is primarily an ETL and batch data processing tool. SSIS can easily read multiple files from a network drive and provides the tools to transform data, either before or after loading it into a database. One of the great things about SSIS in this situation is its flexibility. You can load all of the data in a single batch to a staging table, then move and transform the data while it's on the way to its final destinations using SQL, or you can transform the data on the application side and load it directly to the final tables. You can use any combination of those two methods that suits your environment. You can extend SSIS to monitor directories and load data when it becomes available. Finally, SSIS integrates easily into network operations center monitoring systems and provides the ability to guarantee data security and integrity as required for this application. Moreover, SSIS does not incur licensing costs because it ships with SQL Server.
Figure 1 shows the target schema. As you might expect, in this example, the target system does not match your sources. Additionally, you can expect the iron law of business intelligence (BI) systems will be strictly enforced here—no one ever tells the BI staff anything. Psychic abilities are simply part of the BI developer's job description. You can expect changes to be made to your source systems, or data to be entered in one system but not in another. You must design your systems accordingly.
Figure 1: An example target schema
Now that your sources and targets are defined, you must decide how you will transform your data so that it is properly loaded into the target tables. You can use the objects that come with SSIS out of the box, or simply load the data into staging tables and use SQL set operations to transform and move the data to target tables—there's no set way to perform this task.
You should follow some general rules when transforming data with SSIS. First, you should use SQL for set operations. That is why you have databases to begin with. "Non-set" operations should be handled within the SSIS package, using the objects available or with custom scripts. If there's a specific object on either system that makes your life easier, use it. Always default to the system that lets you write the most efficient code. Generally, this will be where you are most comfortable.
I will make certain assumptions for this illustration that would be risky in real-world applications. I will assume that all the data you will be loading is fresh and that there are no updates to previously loaded records. I will also assume that the source files will not be locked and will be a complete data set when you begin your loading process. There are numerous ways to control for file locking by the transfer process in the real world, including the use of empty trigger files created after the comma-delimited files have been written.
Transforming Data Using SQL
To begin transforming your data using SQL, create a single SSIS package for transferring the data, with the appropriate code in two separate sequence containers. You will need a staging table on your target database, which you can create using the script in Listing 2. Note the use of separate schema for staging and reporting tables, allowing for tighter security control and easier manipulation of storage and partitioning strategies.
create schema staging authorization dbo go CREATE TABLE staging.SalesHeaderSource( SalesID int NULL, StoreID int NULL, storeAddress nvarchar(75) NULL, StoreCity nvarchar(50) NULL, StoreProvince nvarchar(2) NULL, StorePostalCode nvarchar(5) NULL, EmployeeID int NULL, EmployeeFirstName nvarchar(25) NULL, EmployeeLastName nvarchar(25) NULL, RegisterID int NULL, RegisterLocation nvarchar(10) NULL, RegisterType nvarchar(30) NULL, DateOfSale nvarchar(50) NULL, TimeofSale nvarchar(25) NULL, Method_of_Payment nvarchar(25) NULL, CreditCardNumber nvarchar(max) NULL, CustomerID nvarchar(max) NULL, CustomerFirstName nvarchar(max) NULL, CustomerLastName nvarchar(max) NULL, CustomerPostalCode nvarchar(max) NULL, TotalSales numeric(9, 2) NULL, AmountTendered numeric(9, 2) NULL, Change numeric(9, 2) NULL, PriorSalesID nvarchar(76) NULL, ReturnTrans NVARCHAR(2) NULL ) GO
Start an SSIS project, rename the default package to transform.dtsx, and drag a sequence container from the tool box to the package. Rename the sequence container Load Header or a similar, arbitrary descriptive term. Place an Execute SQL task (named prep staging table for this example), a Data Flow task (named Load Staging Table for the example), and a second Execute SQL task (named Cleanse Return Data) in the sequence container. Link the three tasks, as Figure 2 shows. These three steps will handle your staging table load.
Figure 2: The three linked tasks
You should also create connection managers for both your source comma-separated value (CSV) file and target database. Simply right-click the area of the Connection Manager tab, select the appropriate new connection manager, and follow the steps in the resulting dialog box.
As I noted earlier, you will be simply deleting data previously loaded and loading fresh data. Your prep of the staging table will be the execution of a TRUNCATE TABLE statement. The Data Flow task will take data from the .CSV file—which will, of course, be in ASCII—convert the data to the appropriate data type, and load it into the staging table. Figure 3 shows the finished Data Flow task.
Figure 3: The finished Data Flow task.
The CSV file does not have column names in the first row. For ease of coding (and because we mere mortals cannot remember what is in a particular field from a numeric designation), you will need to go into the source’s Connection Manager and rename your columns, as Figure 4 shows. Mapping these column names to the target field in the database will make your life easier in the long term. Once you get to the data load phase of this process, SSIS will automatically map the source column names with the target fields if the names match. For data that needs to be converted to a different data type in the next step, I still find it easier to keep track of what goes where if it is properly labelled at this stage.
Figure 4: Renaming your columns
Next, you will need to convert your ASCII data to the appropriate data types. Those of us of a certain age were spoiled by earlier versions of SQL Server doing ASCII-to-Unicode conversions for us under the covers. You must make this explicit with SSIS. Drag a Data Conversion task into the Data Flow and open it. Select all the fields in the source and convert them to the appropriate data types, as Figure 5 shows. For example, SalesID should be converted to a four-byte signed integer, and the store address should be converted to a Unicode string.
Figure 5: Converting fields to appropriate data types
Now map the data into the correct staging table fields, as Figure 6 shows, and the data will load to the staging table. Drag an OLE DB destination into the Data Flow and map the fields to the target fields. Remember that you should be mapping the correct data types. Make sure you have the correct values, particularly for data that comes from the source as a null. SSIS will treat this null data as a string and load the word null instead of a null value. Here, you will handle this transformation by running an UPDATE statement on the staging table after you complete the load:
UPDATE staging.SalesHeaderSource set PriorSalesID = null where ISNUMERIC(PriorSalesID) = 0 go UPDATE staging.SalesHeaderSource set ReturnTrans = 0 where ISNUMERIC(ReturnTrans) = 0 go
(I describe second method for handling this issue later in this article.)
Figure 6: Mapping data into staging table fields
Now that your data is properly staged, you will move it to the final tables using a series of SQL statements. Recall that I made an (admittedly dangerous) assumption that applies to this step—all of your data is new. Despite this assumption, I will illustrate how to use a MERGE statement, new with SQL Server 2008 and a very convenient alternative to the hoops that you once had to jump through "upsert" data. The data will be moved in a manner consistent with the hierarchy and the foreign key constraints I have defined. First, move the data with the MERGE statement, as shown in Listing 3. Next, for the sake of simplicity, you will execute a series of insert statements. For example, use the insert statements in Listing 4 for registers. You can find the rest of the insert statements in the downloadable code for this article.
begin set nocount on MERGE LossPrevention.Stores as TARGET USING(select distinct StoreID, storeAddress,StoreCity, StoreProvince, StorePostalCode from staging.SalesHeaderSource) as source (StoreID, storeAddress,StoreCity, StoreProvince, StorePostalCode) ON (TARGET.StoreID = source.StoreID) WHEN MATCHED then UPDATE SET storeAddress = source.storeAddress, StoreCity = source.StoreCity, StoreProvince = source.StoreProvince, StorePostalCode = source.StorePostalCode WHEN NOT MATCHED then insert (StoreID, storeAddress,StoreCity, StoreProvince, StorePostalCode) VALUES (source.StoreID, source.storeAddress, source.StoreCity, source.StoreProvince, source.StorePostalCode); end; go
INSERT INTO \\[FindCrooks\\].\\[LossPrevention\\].\\[Registers\\] (\\[RegisterID\\] ,\\[StoreID\\] ,\\[RegisterLocation\\] ,\\[RegisterType\\]) ( select distinct RegisterID, StoreID, RegisterLocation, RegisterType from staging.SalesHeaderSource ) GO
Transforming Data Using SSIS
I just showed you how to execute a transformation on the database side, but this is not always the most practical or best solution. Often you can (and should) use the data transformation objects in SSIS to execute transformations before the data reaches the database. Here, you will start with a Data Flow task once again, but after bringing the data into the SSIS process, you will transform it within that process, rather than in the database. Drag a sequence container onto the package and name it Load Details. Next, drag a Data Flow task into the sequence container and name it Get details.
As in the earlier Data Flow, drag a flat-file source into the Data Flow. You will need a new connection object to the details CSV file and, as before, you should rename the columns to make them human-readable. Now I'll show you how to load two tables, SalesDetail and ReturnSales, using an SSIS multicast object.
The first problem you encounter is that your source system has used the word null for null values. SSIS interprets this value as a string holding the word null, so you need to transform this string, particularly on the numeric and integer value columns. To do so, add a derived columns object and connect it to the flat-file source. As an illustration, add a new column for the Original Customer ID value by substituting an empty string for the word null. You will replace null in the DiscountPercent field with 0, as Figure 7 shows—this is a simple REPLACE function required because SSIS will interpret the word “null” as a string when the field should hold a numeric value.
Figure 7: Replacing null with 0
You must now perform your data type conversions, as you did earlier, converting ASCII to Unicode, and strings to the appropriate numeric or integer types as required. To preserve relational integrity, you need to get the store identifier, as well as other specific data that was provided in the sales header file. You can consider this a transformation of the data set by adding an additional field. To complete this task, drag a Lookup transformation onto the Data Flow. In the data transformation stage, you will have already converted the sales ID to a four-byte signed integer data type, enabling you to relate the sales header data to the sales detail data accurately. Now, relate the sales ID value in the source data set with the SalesID field in the Sales table, as Figure 8 shows.
Figure 8: Relating data in the source data set with the Sales table
To complete the load, you will send your data to two separate tables. Drag a multicast task onto the Data Flow. Think of multicasts as the object Agent Smith used in the second and third Matrix films; it creates as many exact duplicates of the data as you desire. Multicasts will let you send duplicate data sets to multiple destinations. As you can see in Figure 9, you simply send these sets to the OLE DB destinations using fast loads for the tables. (See the web-exclusive sidebar, "Fast Loads," for more information about fast loads.)
Figure 9: Sending sets to OLE DB destinations
Recall that in this application, you are looking for sales that do not meet certain rules. These sales require closer inspection because they could indicate employee theft or other scams. In this case, you send the suspicious data to a flat file. In the real world, you would send it to an Alerts table to flag it for further review.
I have only shown you a few of the many objects available in SSIS and T-SQL to transform data during an ETL process. In addition to the Data Flow transformations available out of the box in SSIS, you can create your own data flow transformations using script objects or T-SQL. Data transformations can turn into a truly fascinating process as you work to integrate the many disparate data sources that exist in most enterprises today.