Scripting Data Flow in SQL Server Integration Services

Downloads
94910.zip

When you develop Microsoft SQL Server solutions, you might need to create SQL Server Integration Services (SSIS) packages to retrieve the data you need. In most cases, SSIS packages contain at least one Data Flow task that extracts, transforms, and loads data into the target destination. For each Data Flow task, you must include the components necessary to carry out the data flow operations. However, sometimes the built-in data flow components can’t perform a specific task. In this case, you can often use the Script component to create a custom script that extends your SSIS package.

Getting Started

To illustrate how to create such a solution, let’s use SQL Server 2005’s AdventureWorks sample database. This database’s Product table includes a column named ProductNumber, which contains string values that identify each product. Although the product number values have a similar structure and all identify the product type and category, in some cases the product number value includes the product size. The product type and category, when combined into one value (i.e., the product number), uniquely identifies each product. The ProdNumber column in Table 1 provides examples of product number values as they’re stored in the ProductNumber column in AdventureWorks’ Product table. To complete my example solution, use the SSIS Script component to extract the category identifier (i.e., the fourth through seventh characters) and product size, when applicable, and to identify each product as one of four types: frame, bike, clothes, or miscellaneous.

To accomplish this task, you must set up a series of conditions within your script to determine how to extract the necessary values. In addition to the Script component, your solution must include other components to support the data flow. For example, you need to add a component to extract data from the AdventureWorks database and a component to add data to a new table within that database. The Script component extracts the necessary data as the data passes through the data flow pipeline. Figure 1 shows the data flow after you add the necessary components. You need to add and configure components in the following order: the OLE DB Source component, the Script component, then the SQL Server Destination component.

Note that before you can develop the solution’s control flow and data flow, you must first create an SSIS project in SQL Server Business Intelligence Development Studio, then create an SSIS package. For information about accessing Business Intelligence Development Studio and creating an SSIS project and package, see “Business Intelligence Development Studio” and “Integration Services in Business Intelligence Development Studio” in SQL Server 2005 Books Online (BOL).

Adding the OLE DB Source Component

Because you’re extracting data from the AdventureWorks database, you can use the OLE DB Source component. Configure the component to extract the Name and ProductNumber columns from the Product table (in the Production schema). Be sure to arrange the result set by product name. You can use the following Transact-SQL statement to retrieve the data:

SELECT Name, ProductNumber FROM Production.Product
ORDER BY Name

When you add the OLE DB Source component, you must add a connection manager that connects to the AdventureWorks database if one doesn’t already exist. You can add this connection manager when you configure the OLE DB Source component or before you configure the component. For information about adding a connection manager, see “Connection Managers” in SQL Server 2005 BOL. For more information about the OLE DB Source component, see the topic “OLE DB Source.”

Adding the Script Component

After you add the OLE DB Source component, you can add the Script component. You can add the Script component to any part of your data flow; however, you can’t use it within your control flow. Data flow includes the operations that retrieve, modify, and load data, whereas control flow, for the most part, doesn’t include those operations but instead manages a package’s workflow. To extend the control flow, you can use the Script task, the control flow counterpart to the Script component. For more information about the Script task, see my article “Scripting Control Flow in SQL Server Integration Services,” InstantDoc ID 94584.

To use the Script component within a package, add it to the data flow as you would a predefined component. Then, configure the component’s properties and create the script necessary to perform a set of tasks. For example, many administrators use the Script component to manipulate string values that you can’t easily use predefined components to manipulate.

When you add the Script component, you need to select one of the following three component types:

Source—Retrieves data from a data source and passes it downstream to the next component in the data flow pipeline
Destination—Receives rows from an upstream component and inserts the data into the target data source
Transformation—Receives rows from an upstream component; reads, modifies, or analyzes the data; then passes the rows to a downstream component

For our text-manipulation solution, select the Transformation option and connect the output from the OLE DB Source component to the Script component. You can then configure the Script component’s properties. To access the properties, double-click the Script component to open the Script Transformation Editor. The only properties that you need to configure for this solution are the input and output columns:

Input columns—On the Input Columns tab, select both columns (Name and ProductNumber) that are passed down the pipeline from the OLE DB Source component. Because you aren’t changing the values within the columns themselves—you’ll create new columns—retain the default values for the Output Alias property for each column, and leave the Usage Type values as ReadOnly.
Output columns—On the Inputs and Outputs tab, expand the Output 0 tree in the Inputs and Outputs pane. Output 0 is the default output. For this solution, Output 0 is the only output you need. Add three output columns to the output: ProdCategory (a string type with a length of 4), ProdSize (a string type with a length of 2), and ProdType (a string type with a length of 10).

Adding the Custom Script

After you configure the input and output columns, you’re ready to write your script, which you must do in the Microsoft Visual Studio for Applications (VSA) development environment. To access this environment, click the Design Script button on the Script tab of the Script Transformation Editor. Doing so opens the VSA window.

The VSA window. The VSA window, which Figure 2 shows, displays the default components and script associated with a Script component when you add it to the data flow. The window has two main panes. The left pane displays Project Explorer, which includes the References node and a node for each project item the default script uses. The References node must include all the namespaces the script references. For the text-manipulation solution, you don’t need to add namespaces.

The right pane includes the default script necessary to implement your solution. The script displayed in the VSA window is associated only with the Script component that you used to access the window. Any changes that you make to this script affect only the associated Script component. If you want to access the script for a different component, you must access the VSA window through that component.

For the text-manipulation solution and for many of your projects, you’ll simply add to the default script and not be concerned about the other features available through the VSA window, unless you need to add a reference to Project Explorer. You must write your script in Visual Basic .NET because the VSA environment doesn’t support any other languages.

The default script. The default script, which Listing 1 shows, includes the essential elements you need to get started. The script will vary depending on the type of Script component you set up when you add the component to your data flow. However, many of the essential elements are the same.

The default script begins with several comments. You might want to delete these comments to make your code less cluttered or to add comments that are specific to your solution.

The actual script begins with the Imports statements that call namespaces from the .NET Framework system libraries and the SSIS runtime libraries (shown in Callout A of Listing 1). Including these statements prevents you from having to use fully qualified names within the script body. For example, if you add an Imports statement to call the System.IO namespace, you can reference the FileInfo function without preceding it with System.IO.

The next element in the script (shown in Callout B of Listing 1) creates an instance of the ScriptMain class. The class provides the entry point for SSIS to call the script at runtime. In most cases, any script that you add will be within a method in the ScriptMain class. Also notice in Callout B that the ScriptMain class inherits the UserComponent class in the ComponentWrapper project item, which lets your script interact with the data as it passes through the data flow pipeline.

The last default element (shown in Callout C of Listing 1) is the ProcessInputRow method for Input0, the default input in the Script component. The script engine calls this method for each row that passes though the pipeline, letting you interact with each row without setting up a complex looping structure. For the text-manipulation solution, you need to use only the ProcessInputRow method, which means that you add your entire script within this method (indicated by the Add your code here comment in Listing 1).

The custom script. The script you add to the ProcessInputRow method consists primarily of a series of If statements that test for specific conditions based on the values in the ProductNumber column. Because of the way the values are defined, values that are longer than 7 characters include the product size in the product number value. In addition, values that are 10 characters long indicate either a bike or a frame, and values that are 9 characters long indicate clothing. The prefix BK precedes the bike product numbers, and FR precedes the frame product numbers. You can use this information to set up the necessary conditions to extract the data you need.

Listing 2 shows the entire custom script, without the default comments. As Callout A shows, you must declare a variable to hold the two-letter prefix from the product number values. This prefix indicates whether the products are bikes or frames. Declare the variable as a string and use the UCase function to retrieve the value in uppercase, to support case-sensitive comparisons later in the script. To retrieve the two-letter prefix, use the ProductNumber property of the Row object, which is the object the ProcessInputRow method creates for each row. The ProductNumber property holds the value of the ProductNumber column for the specific row. To extract the prefix, call the property’s Substring method and pass the arguments 0 and 2. The first argument indicates the starting position (based on a 0-based index) and the second value indicates the number of characters to extract.

Next, create an outer If statement (shown in Callout B of Listing 2) that determines whether the ProductNumber value is longer than 7 characters. If the value is longer than 7 characters, you can assume that a product size is associated with the value. For these products, determine the size and type and assign them to the corresponding properties. To determine a value’s length, call the Length property of the Row.ProductNumber property. To determine a product’s size, use the Substring function for the ProductNumber property to extract the last one or two characters, depending on the product type. Then, assign the extracted value to the Row object’s ProdSize property. To set the product type, assign the applicable value to the Row object’s ProdType property.

Then, create another If statement (shown in Callout C of Listing 2) within the outer If statement, to determine whether the length of the product number is 10. If it is 10, the product is a bike or a frame. In that case, extract the size by calling the Substring method and passing the arguments 8 and 2 to retrieve the size. Also, if the size is 10, implement an If-ElseIf-Else statement block within the preceding If statement. If the prefix is BK, set the product type to Bike; if the prefix is FR, set it to Frame. Otherwise, set the product type to Misc.

If the length of the product number value isn’t 10, you can assume that it’s 9 (because you already know it’s greater than 7, and 9 is the only other option because product numbers never have a length of 8). Define an Else statement that retrieves the size, which in this case is just one character, and that specifies Clothes as the product type.

You must then add the Else statement to the outer If statement. The Else statement runs when none of the previous conditions are met. Simply set the product type to Misc, and don’t extract any sizes.

After you determine the product type and set any sizes, the last step is to extract the product category from the original product number values (shown in Callout D of Listing 2). Because you must extract the product category for all product types, you don’t need to test for a condition before extracting the value. Instead, you simply use the Substring method and pass the arguments 3 and 4. You then assign this value to the Row object’s ProdCategory property.

Adding the SQL Server Destination Component

The final step in developing your solution is to add a SQL Server Destination component to your data flow. However, a table must first exist to hold the extracted data. For the text-manipulation solution, you can simply add the table that Listing 3 shows (i.e., NewProducts) to the AdventureWorks database. Notice that the NewProducts table definition includes the original two columns extracted from the Product table, as well as the additional columns that you’re outputting from the Script component.

After you add the table, you can add a SQL Server Destination component to the data flow and connect the output from the Source component to the SQL Server Destination component. Then, configure the destination component by mapping each input column passed down the pipeline to the available destination columns in the NewProducts table. Because you added the NewProducts table to the AdventureWorks database, you can use the same connection manager that you used for the OLE DB Source component. For more information about the SQL Server Destination component, see “SQL Server Destination” in SQL Server 2005 BOL.

A Rich Environment

To test your solution, run the package, then query the NewProducts table to verify that the data was added. Table 1 shows part of the result set.

As this solution demonstrates, the Script component provides a rich environment for extracting, transforming, and loading data. Although my example solution focuses on manipulating text, you can use the Script component for a variety of tasks, such as aggregating data or separating data into multiple outputs. You can use the Script component to extend any SSIS package’s data flow to fit your solution’s requirements.

Comments

Plain text