Rev Up Performance with Custom Connectors

Downloads
45801.zip

One of the most dramatic changes in SQL Server 2005 is the redesign of Data Transformation Services (DTS); renamed SQL Server 2005 Integration Services. SSIS is a strong extraction, transformation, and loading (ETL) product that offers insane performance, a large catalog of dynamic components, a sound deployment model, and flexibility and extensibility. Extensibility has been a strong suit for Microsoft products, and the company has concentrated on building robust platforms that meet common customer needs but that customers can extend to meet uncommon needs as well. As you consider the role SQL Server 2005—and particularly SSIS—will fulfill in your environment, you need to understand what platform extensibility will let you do.

Let's walk through how to create, install, and test a sample custom source component for SSIS that can read and parse log files in an Internet Information Server (IIS) Web log. This custom component lets an SSIS package consume an IIS log file as a data source, then transform and direct the file to a destination component. You'll see how easily you can create a simple custom source component.

Note that we developed and tested this example on SQL Server 2005 Beta 2 October Community Technology Preview (IDW 9) release. At this point in the product's development, the major pieces should be pretty well baked, but some details, such as object names that include the old DTS name, might still change before the product is released. Although we expect that this example will still compile and run on future betas and the released product, Microsoft made some significant changes in SSIS between Beta 1 and Beta 2, and those kinds of large changes could still show up. For background information about the kinds of changes you can expect when you move from DTS to SSIS, see Kirk Haselden's articles in Related Reading.

What Is a Custom Source Component?

A source component is an SSIS adapter that feeds information into a data flow. SSIS lets you develop custom components that you can connect to unique sources or destinations or use to accomplish specific transformation tasks. You can develop a custom source component to connect to data sources that you can't access by using one of the existing source adapters or to consolidate parsing or script logic into the data extraction.

Microsoft has recently made a point of enhancing products by letting users and third parties extend the products with their own innovations and customizations. Although you could create custom components in DTS, it wasn't easy. Instead, most packages relied on extensive script tasks for complex operations. With SSIS, custom components are easier to create, drastically reducing the need for script and improving the performance and robustness of SSIS solutions.

Creating the Component

The example source component that we'll create parses a default IIS Web log into columns for an SSIS output buffer, allowing a package to input the log into a database, an Excel spreadsheet, or any destination. The default settings for an IIS Web log put the various field names—request time, requesting IP address, method, Uniform Resource Indicator (URI), and Status Code—on the fourth line of the file. If you've ever attempted to parse one of these Web logs by using a standard text adapter, you know that the format can be cumbersome; it contains space-delimited information and unpredictable information in the URI string. Although you can use other solutions to parse Web logs, this file type works well for our example.

You create a custom component by developing a .NET assembly that inherits the Microsoft.SqlServer.DTS.Pipeline.PipelineComponent base class. This base class defines the methods that SSIS will call in your component to drive the Data Flow task. Your custom component can safely override (or ignore) any methods it needs to accomplish its task. If you don't provide an override for a method, the default implementations in the PipelineComponent base class will handle the calls.

The PipelineComponent interface creates source, transformation, and destination components. Although they have different roles in the Data Flow, the components look similar; only the features they choose to implement differ. A source component has an output, a destination has an input, and a transform has both, with data-modifying logic between the input and output. SSIS asks only that you let it know which section of the toolbox the component belongs in.

Setup. Let's begin by creating a new Class Library project in Visual Studio 2005. The name you choose for the project will be the default name of the assembly you create, so choose a name that describes the component you're building. An unwritten convention is to use Src or Dest at the end of your component name to indicate its function. The project we create in this example is named IisLogFileSrc.

The next step is to add references to your project to tell Visual Studio where to find the SQL Server objects you'll be working with. Select the Add Reference item from the Visual Studio Project menu. On the .NET tab of the resulting dialog box, select four components: Microsoft.SqlServer.DTSPipelineWrap, Microsoft.SQLServer.DTSRuntimeWrap, Microsoft.SQLServer.ManagedDTS, and Microsoft.SqlServer.PipelineHost.

When you start your project, Visual Studio automatically creates an initial class for you (probably named Class1—the name of the class isn't important). First, we need to tell the compiler which references we'll be using, which we do by employing the easy-to-remember using statement that the C# code in Listing 1 shows. We need to create using statements for the Pipeline and Runtime components we referenced above, and because we'll be reading from a file, we need to list System.IO as well.

We also need to slightly modify the AssemblyInfo file. As the name suggests, this file gives the compiler additional information about your assembly through attributes. By default, Visual Studio creates a new version of your component every time you build it. But because the registration of your component in SSIS is version-specific, we want to prevent this behavior, so we replace the default value of the AssemblyVersion attribute (1.0.×) with a specific version, such as 1.0.0.0.

Eventually, we want to add this custom source component to the Global Assembly Cache (GAC), so we need to give it a strong name. The easiest way to do this is by using the .NET Framework 2.0 SDK Strong Name Utility (sn.exe). The –k option lets you create a keyfile that becomes part of your project and should be copied into your project folder. Use the AssemblyKeyFile attribute to specify the keyfile for the compiler.

Back in Class1, we next need to create a class attribute that tells the Business Intelligence Development Studio about our component. In Listing 1, the attribute contains minimal information: the name of the component and what type of component it is. You can supply additional information, including an icon that will appear in the Business Intelligence Development Studio toolbox.

Finally, Listing 1's code shows that we want our class to inherit from the PipelineComponent class. This relieves us of having to implement every method the interface requires because the base class will receive any calls we don't handle. The base class also lets Visual Studio help us out with IntelliSense, which automatically creates the signatures for methods that we choose to implement.

Discovery. Next, we begin adding to our class methods that will intercept calls from the SSIS Data Flow. First, we use the ProvideComponentProperties method that Listing 2 shows to describe the component to the SSIS package designer. Listing 2 shows a call from the designer to ask the component what inputs, outputs, connections, and other properties it needs to do its job. In the designer, SSIS adds to the component's property pages fields that let a user provide that information. Because our source component is simple, we need only a connection and an output.

Listing 2's code tells SSIS to forget what it already knows about the component. This safety precaution removes any previously added outputs or connections and prevents duplicate information if the component was previously initialized. The code then adds an output object to the output collection to tell SSIS that we intend to produce output. Because our example has only one output object, we can get away with naming it Output. To tell SSIS what our output will look like, the code adds columns to the output object to represent the default columns of the IIS log file we'll provide. For simplicity, this example uses strings for all fields.

Finally, the code tells SSIS that the component needs a connection. We want a file connection to the log we'll be reading but strangely, SSIS doesn't have a way to specify what type of connection the component requires. So we'll just name it File Connection. For a more robust implementation, we'd need to check the connection that the component receives later in the process to make sure that it's the correct type.

Start your engines. The second step in creating our component is to process the SSIS requests to acquire our connections before execution and to release the connections during cleanup after execution. As Listing 3 shows, we acquire the connection by accessing the connection manager and receiving the file connection we requested. The code includes a couple of lines that let it find the object we need, so we get a simple filename that a user has specified in the designer. We use the AcquireConnections method to open the file, and the ReleaseConnections method to close the file, storing the file handle in a private variable so that it's available for the execution step.

Execute. The third and final step is to provide the output we want by implementing the PrimeOutput method that Listing 4 shows. SSIS has created a buffer for each of the outputs we said we'd provide, and it passes those buffers to the PrimeOutput method as an array. Because this example produces only one output, our array has only one buffer element. The PrimeOutput method fills that buffer with data and closes it.

Because we opened the file handle in the previous step, the component has only to read each line of the file and parse the needed information. An IIS log file has a header section at the top, so we'll skip any lines that start with the # character that identifies that header. The data fields of the log-entry lines are separated by spaces, which makes for simple parsing. If the data doesn't contain any spaces, the parsing won't work, but the test files we used for this example didn't have any problems.

Once the component has parsed the data and is ready to add it to the buffer, the component calls the AddRow method and sets each of the values. Listing 4 includes a precautionary step to truncate the field data at the size we defined for the output column. This step prevents an overflow that would result in an error when the package runs. Because IIS administrators can customize IIS log files to include extra information (and URIs can get pretty long), this is a good safety feature even if you're reasonably confident about your file formats.

Finally, when Listing 4's code reaches the end of the log file and has written all the data to the buffer, the code calls the SetEndOfRowset method on the buffer to tell SSIS that it's finished. The code can close the component and clean up the buffer.

Other considerations. Although this is a simple example, making the implementation production-ready requires a little bit more work. Microsoft says that you can't assume at the PrimeOutput step that your columns are in the same order that you put them in the ProvideComponentProperties step. The buffer manager reserves the right to add placeholders for extra columns, which might get merged later into the data flow; your component wouldn't necessarily know about these extra placeholders. The buffer manager might also rearrange your component's columns to better fit on your memory pages. Microsoft suggests that instead of relying on the column order being what you defined, you should implement the PreExecute method and use the buffer manager's FindColumnByLineageID method to locate your columns and build an array that you can use in PrimeOutput to look up column indexes by name. We condone this practice, but in the interest of space and simplicity, we haven't shown it in the example.

Installing the Component

Installing your component so that the Business Intelligence Development Studio can use it is easy. First, you must copy the assembly you created to the folder where SSIS pipeline components reside. Then, you need to add the assembly to the Global Assembly Cache (GAC). Finally, you can add the component to the Development Studio toolbox, where a user can select it. Every time you change the component you have to copy the file, uninstall the previous version from the GAC, and install the new version of the component to the GAC, so we suggest you create a batch file to do this for you. Listing 5 shows a sample batch file that we use on our development system.

To add the component to the toolbox, open the Tools menu and select Choose Toolbox Items to open the dialog box that Figure 1 shows. Your component should appear on the Data Flow Items tab of the dialog, as Figure 2 shows. Select your component and click OK, and it appears in the toolbox in the Data Flow Sources section.

To upgrade your component, simply close any solution using the component, run the installation batch file again, and reopen the solution. The solution should automatically receive the new version of the component. The only time this process didn't work for us was when we changed the component's class name. This change caused the Business Intelligence Development Studio to recognize the component as a different component (albeit with the same name). We had to delete the old component from the toolbox and add the new one. We also had to delete the component from any projects and add it again.

Building a Test Package

To test the sample component, create a new Data Transformation Project in the Business Intelligence Development Studio. You'll start with a blank Control Flow surface to which you need to add a Data Flow Task. Double-clicking this task will bring you to a blank Data Flow surface. Drag your new source component from the toolbox and drop it on the surface.

Next, you need a File Connection to tell your component where to find the log file you want to import. Right-click the Connections section at the bottom of the Data Flow surface, and select New File Connection. Be sure not to select New Flat File Connection, which is a different connection intended for the Flat File Source component. For the example file connection, the usage type should be set to Existing File; then you can add the IIS log file you want to import. IIS log files typically reside in the C:\WINDOWS\system32\Logfiles path on a system that has IIS installed. Obviously, your Web server would need to have some content and traffic to generate logs. Alternatively, you could easily mock up some sample files based on the file format descriptions above.

Now you need to add your new connection to your component. Right-click the example component and select Edit from the context menu. The editor will appear, and the first tab you see should be called Connection Managers. The connection your component requested in ProvideComponentProperties should be there with a drop-down box that lets you select your newly created File Connection.

Finally you need a destination. When testing source components, we love using the DataReader Destination. This no-fuss destination lets you quickly and easily test your component. Of course, you could also send the data to SQL Server or Excel or anywhere else by selecting a different destination, but other destinations typically need additional configuration. To use the DataReader Destination, simply drag the output from your source and drop it on the destination. Then, edit the destination and select the columns from your output so SSIS doesn't complain about unused columns. We always put a Data Viewer on the connection, as Figure 3 shows, so that we can see that the source is working; simply right-click the connection, select Data Viewers, and add one.

Your package should now be ready to run. Click the green arrow to compile and execute in the debugger, and your Data Viewer should pop up, displaying data from your IIS log file in its grid. For more information about developing custom source components, see the "Creating a Source Component" topic in the SQL Server 2005 Books Online.

Build Your Own

Perhaps you have a unique data extraction or transformation task that would benefit from a custom data connector. As you dig into SQL Server 2005 and try your hand at building new SSIS packages to perform simple or complex data movement, look for a scenario that leaves you scratching your head trying to find the right SSIS source adapter or an area in which you'd like to get a performance gain by consolidating transformation logic into your extraction. Such a situation could be the perfect chance to build a custom source component.

Related Reading

KIRK HASELDEN
"What's New in DTS?" May 2004, InstantDoc ID 42141
"Making Package Magic," October 2004, InstantDoc ID 43805
"Editing a Package With SQL Server 2005 Integration Services Designer," March 2005, InstantDoc ID 45092

Comments

Plain text