Using an SSIS Script Component to Import Data to a Wide Table

Wide tables are a feature introduced in SQL Server 2008. Their primary advantage is implied by their name: A wide table can have 30,000 columns, as opposed to the 1,024 columns you can have in a non-wide table.

A wide table is defined as a table that has sparse columns and a column set defined. Sparse columns are columns that are optimized for the case when many rows will have NULL for a value for that column. Column sets are special untyped XML columns that represent all the populated sparse columns in a row as XML. Updating the sparse column causes the column set XML to be automatically updated, and updating the column set (through XML methods or some other means) causes the sparse columns it represents to be updated accordingly.

Having 30,000 or even 1,024 columns in a table is a lot of columns, so thoughtful review should be done of any design that requires that many in one table. However, there are at least two circumstances in which that many columns can be called for:

Sometimes information from other systems can come in a wide format, over which you have no control. I've personally worked with comma-separated value (CSV) files that had more than 7,800 columns.
Data warehouses often use denormalized versions of tables to improve performance, where querying a normalized table of many rows would take unacceptably long. This is, incidentally, a situation in which sparse columns could save considerable space.

There's at least one drawback to using a column set: SQL Server Integration Services (SSIS) won't let you target sparse columns directly when your destination table is a wide table. You can target the column set and populate it with XML representing the populated sparse columns, but you can't directly target the sparse columns that compose it.

You also can't add a column set to a table that already has sparse columns but not a column set. An error will result. Thus, it isn't possible to populate your sparse columns through SSIS, then add a column set. In any case, unless you have a column set, your table isn't a wide table and therefore can only have 1,024 columns.

Let's suppose you need to target a wide table in SSIS. Since you can't target the sparse columns, your only option is to populate the column set XML with an XML fragment representing the populated columns. How can you do that? An SSIS Script component will let you build an XML fragment, at the cost of some light programming. I'll walk you through a simple example, which involves the following steps:

Create a wide table.
Add a Script component.
Add a column set.
Map the column set.
Create an XML mapping document.
Make the XML mapping document an assembly resource.
Modify the source code.

Step 1: Create a Wide Table

You first need to create the wide table. Listing 1 shows the CREATE TABLE statement you can use to create the wide table I'll be discussing here.

CREATE TABLE [dbo].[WideTable](
  [ID] [int] NOT NULL,
  [Comment1] [nvarchar](255) SPARSE  NULL,
  [Comment2] [nvarchar](255) SPARSE  NULL,
  [Comment3] [nvarchar](255) SPARSE  NULL,
  [Comment4] [nvarchar](255) SPARSE  NULL,
  [ColumnSet] [xml] COLUMN_SET FOR ALL_SPARSE_COLUMNS  NULL
) ON [PRIMARY]

This very small wide table has only six columns, including the ID column and the column set. Figure 1 shows the source CSV file that contains the data for the wide table.

ID,Comment_Source_1, Comment_Source_2, Commnet_Source_3, Comment_Source_4
1,"We the People of the United States","in Order to form a more perfect Union","establish Justice","ensure domestic Tranquility"
2,"provide for the common defense","promote the general Welfare, and","secure the Blessings of Liberty",
3,,"to ourselves and our Posterity","do ordain and establish this Constitution",
4,,,,"for the United States of America."

Step 2: Add a Script Component

The next step is to create an SSIS package and add a Script component to it. I'll assume that you already know how to open SQL Server Business Intelligence Development Studio (BIDS), create a new SSIS package, create a Data Flow task, and set up source and destination connections on the Data Flow tab.

After you create your SSIS package, set up a Flat File connection manager to connect to the CSV file. Afterward, open up the Flat File connection manager and select the Advanced option. Check that your input columns' data types are appropriate for each column and that the OutputColumnWidth values are large enough to accommodate the data. You'll probably find that the columns are all being imported as text. In that case, you might want to edit them manually or use the Suggest Types option to have the package analyze the data and set data types and column widths. If you use Suggest Types, be sure afterward to set the data type for the ID column to four-byte signed integer [DT_I4]. Otherwise, you might get error messages about incompatible data types later on.

Next, you need to set up an OLE DB connection manager and data flow destination for the destination table. Once that's done, add a SQL Script component to the Data Flow tab by dragging the Script Component icon from the toolbox to the Data Flow tab design area. When you're asked how the script will be used in the data flow, select Transformation. Connect the green arrow from the Flat File Source icon to the Script Component icon. Connect another green arrow from the Script Component to the OLE DB Destination icon. Your screen should look like that in Figure 2.

Step 3: Add a Column Set

In this step, you need to tell the Script component what input columns it's going to use, including the column set. Begin by right-clicking the Script Component icon, then click Edit. In the Script Transformation Editor, click Input Columns in the left pane. Your screen should display a single box showing the input columns, with check boxes by each one. Select the Name check box, which will select all the columns because it's the master check box. Your screen should now look like the one in Figure 3.

Figure 3: Selecting the input columns for the Script component

You need to create content for a column named ColumnSet, but as you can see in Figure 3, the Script component doesn't know it exists. The only columns it knows about are those that come from the source CSV file. So, you need to modify the Script component's properties to tell it that it's also going to be working with the ColumnSet column.

In the Script Transformation Editor, click Inputs and Outputs in the left pane. Expand the Output 0 node, and select the Output Columns node. This will enable the Add Column button.

Click Add Column, and name the new column ColumnSet. In the Data Type Properties section, change the DataType entry to Unicode text stream (DT_NTEXT), as Figure 4 shows.

Figure 4: Setting the ColumnSet column's DataType entry

This is important because SSIS doesn't have a property type for XML columns. Unicode text stream is what you can use instead. Click OK to exit out of the Script Transformation Editor.

Step 4: Map the Column Set

Now that you've told the Script component that it's going to be working with the ColumnSet column, you need to map it to the destination table. Right-click the OLE DB Destination icon and select Show Advanced Editor.

In the Advanced Editor for OLE DB Destination dialog box, click the Column Mappings tab. The component might already have mapped your ColumnSet column to the target column, but if it hasn't, you can map it manually by dragging your ColumnSet column in the source data over to the ColumnSet column in the destination data.

Figure 5 shows what this mapping looks like.

If you're wondering why the destination table looks so empty, it's because SSIS doesn't allow you to map sparse columns directly. So, only the ID and ColumnSet columns appear in the destination table.

Step 5: Create an XML Mapping Document

At this point, the Script component is hooked up, but it doesn't do anything yet. Therefore, you need to edit it. Right-click the Script component, select Edit, and click the Edit Script button. A new instance of the Visual Studio editor will open, with a lot of code already loaded.

A fresh Script component project can be a little alarming. Where did all this code come from? What should you do to it? Fortunately, most of the code is boilerplate code, auto-generated by SSIS. For the most part, you shouldn't change this code because it gets freshly generated when the source or destination components change their definitions. The only code you have to change is in the main.cs file in the ScriptMain class, which Visual Studio conveniently loads and places right in front of you.

ScriptMain has three methods that you can modify according to your needs: PreExecute and PostExecute (which run before and after the transformation, respectively), and the less clearly named Input0_ProcessInputRow. The latter method does the real work. Input0 is the input from the flat file. The parameter Row, an instance of Input0Buffer, contains read-only properties representing the values for each input field in the input buffer, plus an editable field representing the ColumnSet column you added. What you need to do is create some XML code for ColumnSet. The package will then execute Input0_ProcessInputRow for every input record it reads.

For a small table, building the XML code for ColumnSet is straightforward. Because you have four sparse columns going into ColumnSet, you can write four if statements that check the source column for Null and write an element string with the correct column name and source data. Listing 2 shows this code. But what if you had a really wide table, with a hundred or more sparse columns? That would be a lot of if statements to write.

public override
  void Input0_ProcessInputRow(Input0Buffer Row)
  {
    System.IO.MemoryStream ms =
  new System.IO.MemoryStream();

// Note that the encoding has to be Unicode.
  System.Xml.XmlTextWriter xw =
  new System.Xml.XmlTextWriter(ms,
  System.Text.Encoding.Unicode);

if (false == Row.CommentSource1_IsNull)
  xw.WriteElementString("Comment1", Row.CommentSource1);
    if (false == Row.CommentSource2_IsNull)
  xw.WriteElementString("Comment2", Row.CommentSource2);
    if (false == Row.CommentSource3_IsNull)
  xw.WriteElementString("Comment3", Row.CommentSource3);
    if (false == Row.CommentSource4_IsNull)
  xw.WriteElementString("Comment4", Row.CommentSource4);

xw.Close();
    Row.ColumnSet.AddBlobData(ms.ToArray());
  }

Fortunately, there's a fairly simple way to set up your script component to automatically convert all your sparse columns to an XML fragment, without needing to reference specific columns individually. The mapping of input columns is done with the help of an XML mapping document that tells the component which columns to use and how they map to the sparse columns.

You can easily create the XML mapping document in SQL Server Management Studio (SSMS). You just need to open up a new query window, set the current database to the database containing the target table, and execute the code in Listing 3, changing the table names as necessary.

SELECT "@sourcePropertyName" = ''
    ,   "@targetColumnName" = name
FROM sys.all_columns
WHERE object_id = object_id(N'dbo.WideTable')
  AND is_sparse = 1
FOR XML path('Mapping'), root('Mappings')

As Figure 6 shows, the query produces a hyperlink.

Figure 6: Clicking the hyperlink created by the query

Clicking the hyperlink opens a new window that contains an XML mapping document like the one in Figure 7.

The targetColumnName attributes specify the names of the sparse columns in the wide table. The sourcePropertyName attributes specify the property names used in Input0Buffer. The sourcePropertyName attributes' values are blank because you can't assume that they have the same name as the target columns. Source column names are often different from target column names. For example, in this case, the first source column is named Comment_Source_1, but the target column is named Comment1. In addition, the SSIS process might transform the column name to get the property name it uses for Input0Buffer in BufferWrapper.cs. For example, the first source column is named Comment_Source_1, but in Input0Buffer, the corresponding property is CommentSource1 (no underscores). So, it's necessary for you to fill in the sourcePropertyName attribute values in the XML mapping document, which you'll do in the next step.

Step 6: Make the XML Mapping Document an Assembly Resource

After you create the XML mapping document, save it in a convenient place and return to the Integration Services Script Component window in Visual Studio. Right-click the Project node in Project Explorer, select the option to add an existing item, and browse to your saved XML mapping document. Select it and click Open.

Once the XML mapping document has been added, click its filename in Project Explorer to select it, and look at its properties. Typically, the properties are displayed in a box under Project Explorer, but they might be displayed in a separate tab. Find the Build Action property and change it to Embedded Resource, as shown in Figure 8.

Figure 8: Changing the Build Action property

Next, open the XML mapping document. Enter the source column names in the Mapping elements, as shown in Figure 9. Use the column names as they appear in Input0Buffer.

Figure 9: Entering the source column names

Step 7: Modify the Source Code

Now it's time to change the source code. Three modifications need to be made.

Modification 1. In main.cs, replace the first two using statements at the top of the class with this list:

using System.Collections.Generic;
using System.IO;
using System.Reflection;
using System.Text;
using System.Xml;

Be careful not to remove these two lines:

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

Next, add the following class-level variables to the class:

Dictionary propsByName = null;
Dictionary mappingsByName = null;

They should go after the opening bracket of the ScriptMain class, right before the definition of the PreExecute method.

Modification 2. Replace the existing PreExecute definition with the code in Listing 4. This code makes two preparations. First, it retrieves and caches property definitions for Input0Buffer (BufferWrapper.cs). Second, it retrieves the XML mapping document you stored as an assembly resource. The full name of the resource is needed to retrieve this document. Fortunately, the Assembly class provides an array that contains the names of all the assembly's resources.

public override void PreExecute()
{
  base.PreExecute();

  // Get the PropertyInfo objects for
  // all the properties of Input0Buffer.
  var props = typeof(Input0Buffer).GetProperties();

  // Cache the PropertyInfo objects by name
  // to save subsequent GetProperty calls.
  propsByName = new Dictionary();
  foreach (PropertyInfo pi in props)
  {
    propsByName.Add(pi.Name, pi);
  }

  // Load the mappings from the assembly resource
  // into an XDocument.
  string resourceName = Assembly.GetExecutingAssembly()
    .GetManifestResourceNames()[0];
  Stream stream = Assembly.GetExecutingAssembly()
    .GetManifestResourceStream(resourceName);
  XmlReader reader = new XmlTextReader(stream);

  XmlDocument doc = new XmlDocument();
  doc.Load(reader);

  // Load the mappings from the XML mapping document
  // into a dictionary.
  mappingsByName = new Dictionary();
  foreach(XmlNode node in
    doc.GetElementsByTagName("Mapping"))
  {
    mappingsByName.Add(
node.Attributes["sourcePropertyName"].Value,
node.Attributes["targetColumnName"].Value);
  }
  reader.Close();
}

Usually you have only one resource in a project, but it's possible to accidentally create another one when manipulating a Script component. In that case, you don't know which resource will be listed first. Thus, you need to check for additional resources. Look under the Properties node in Project Explorer for a file with the name resources.resx. If it exists, delete it.

Because there's only one resource, the resource name array will contain only one name, which the code uses to retrieve the resource. The code then creates a dictionary object from the XML mapping document, which it uses to look up the target column's name by means of the buffer's property name.

Modification 3. Replace the existing Input0_ProcessInputRow definition with the code in Listing 5. This code uses a dictionary table made from the XML mappings to loop through the names of all the buffer columns you want to include in the ColumnSet row values. Because a sparse column can be null, it will have an IsNull property associated with it. This property's name is always the name of the sparse column with _IsNull appended to the end. For each sparse column, the code builds the name of the associated IsNull property, checks that property's value, and skips further processing of that column if the value is null.

public override
  void Input0_ProcessInputRow(Input0Buffer Row)
{
  // Create an XmlTextWriter that writes to a memory stream.
  // Note that the encoding must be Unicode.
  MemoryStream ms = new MemoryStream();
  System.Xml.XmlTextWriter tw =
    new System.Xml.XmlTextWriter(ms, Encoding.Unicode);

  foreach (string bufferColumnName in mappingsByName.Keys)
  {
    // Create a property name representing the
    // _IsNull property for the buffer column.
    var nullPropName = bufferColumnName + "_IsNull";

// Check each Sparse column's IsNull property.
    var nullProp = propsByName[nullPropName];
    bool propIsNull = (bool)nullProp.GetValue(Row, null);

// If the property is not populated, skip it.
    if (true == propIsNull)
  continue;

// Get the PropertyInfo for the data property.
    string targetColumnName =
  mappingsByName[bufferColumnName];
    PropertyInfo dataProp = propsByName[bufferColumnName];

// Get the value for that property.
    object value = dataProp.GetValue(Row, null);

// If there's a null value, skip the column.
    if (null == value)
  continue;

// Write the value as an element to the textwriter.
    tw.WriteStartElement(targetColumnName);
    tw.WriteValue(value);
    tw.WriteEndElement();
  }

  tw.Close();

  // Add the textwriter data to the ColumnSet column.
  Row.ColumnSet.AddBlobData(ms.ToArray());
}

If the value isn't null, the code retrieves the PropertyInfo object for the data property (i.e., the sparse column) from the dictionary of PropertyInfo objects and fetches its value. The code then checks it for a null value (it shouldn't have one, but does so just in case). If it's not null, the code writes the value as an element to the XmlTextWriter object.

You might have noticed that the code in Listing 5 uses a different method to write to XmlTextWriter than was used in Listing 2. The code in Listing 5 uses the WriteValue method because it has many overloads and can handle input of almost any primitive type. The method will format the input to the correct XML representation for that type and write it to the memory stream. In this case, the code passes the value variable, which is of type object. WriteValue then determines the correct format for the value variable when writing it to the memory stream.

After you've entered the new Input0_ProcessInputRow code, close the Visual Studio Tools for Applications (VSTA) instance and click the OK button of the Script Transformation Editor. You're now ready to run your transformation. It should run quickly (there's very little input data) and without error.

An Easy Yet Effective Workaround

As you've seen, it's easy to create a Script component that builds a column set value. Creating a generic version that will work for any number of sparse columns of different types is also fairly straightforward. Sparse columns have the potential to save considerable space in large, diffuse tables -- and wide tables make a good staging ground for imported tables. They can also provide a permanent storage space for denormalized tables. With the Script component you created, you can now import data to wide tables almost as easily as to traditional non-wide tables.

Comments

Plain text