Using XML Bulk Load with Identity Columns

Downloads
40239.zip

Suppose your company has just established a purchasing relationship with a new computer-books supplier. The supplier periodically sends you a large XML document (between one and several hundred megabytes) that contains a catalog of publications. As a database professional, you have to load the XML catalog into your company's online catalog application, which uses a SQL Server database, so that employees can search the data.

Scenarios like this one are becoming more commonplace. Microsoft anticipated the need for loading large XML documents into SQL Server and provided an excellent utility, XML Bulk Load, that accomplishes that job. XML Bulk Load is a standalone COM object that reads data from an XML document by using a stream rather than reading the entire XML document into memory, which is impractical, if not impossible, for large documents. Using a stream conserves memory and increases the processing speed for large documents. But streams can present a challenge when you're loading data into a normalized database that uses identity columns for primary keys.

Identity columns contain values that SQL Server automatically generates when a row is inserted into a table. The values in an identity column uniquely identify the row, so they're good candidates for foreign keys that reference the row's data. But say you're using XML Bulk Load to store data in two related tables; an identity column is the primary key in the first table, and a foreign key in the second table references that column. In that case, you need to write a program to match rows from the two tables and correctly set the foreign key values.

Besides adding more work to your task, writing a program also proves problematic for cases when a secondary key doesn't exist in the first table because you have no way other than the identity column to uniquely identify rows. But now you can avoid programming altogether. With the release of SQLXML 3.0 Service Pack 2 (SP2) Beta 1, which you can download at http://www.microsoft.com/downloads/details.aspx?FamilyId=4C8033A9-CF10-4E22-8004-477098A407AC&displaylang=en, there's a solution to this problem in XML Bulk Load. Let's look at how it works.

XML Bulk Load and Identity Columns

Before jumping into an example that illustrates how to use XML Bulk Load to properly store the value of a foreign key, let me explain how XML Bulk Load operates. The XML Bulk Load COM object provides properties that you use to set parameters and options. The available parameters include the connection string, the ADO Command object, and the location of an error log file. Options include the ability to lock tables and execute within the context of a transaction. XML Bulk Load uses an XML View to load data into SQL Server. A mapping schema defines the XML View and specifies how data from the XML document is stored in the tables in your database. The mapping schema also specifies how the relationships between elements in the XML document—most commonly expressed by nesting one element inside another—are modeled in your normalized relational database. Correctly constructing the mapping schema is the crucial part of successfully inserting foreign keys that use identity-column values.

To understand how you construct an appropriate mapping schema, first consider the XML document that Figure 1, page 33, shows. This simple XML document represents the supplier's catalog I described earlier. Each publication listed in the document includes the publication's author. The goal is to bulk-load the data from Figure 1's document into the tables that the T-SQL code in Listing 1 defines. Note that each table contains an identity column as the primary key (PubId and AuthId), and Titles contains a foreign key column that references the author of the publication in Authors.

Listing 2 shows a mapping schema that maps publications and authors from the XML document into the Titles and Authors tables, respectively. The mapping schema is an XML Schema Definition (XSD) schema augmented with annotations. In Listing 2, the annotations appear in red. I won't explain the purpose of each annotation here, but let's examine the sql:relationship annotation at callout A in Listing 2 because that annotation specifies the foreign key relationship between the tables.

The sql:relationship annotation in the code at callout A uses the parent, parent-key, child, and child-key attributes to specify the primary key (AuthId) and foreign key (AuthRef) columns in the Authors and Titles tables. The innovation in SQLXML 3.0 SP2 is that XML Bulk Load detects the identity columns, buffers the value from the identity column (AuthId), then stores the value in the foreign key column (AuthRef). But for this particular example, you first need to do one more thing: reverse the order in which XML Bulk Load creates rows in the Titles and Authors tables.

XML Bulk Load first creates the row in the table specified by the parent attribute, then creates the row in the table specified by the child attribute. In this example, XML Bulk Load would create the row in the Titles table before creating the row in the Authors table. But creating the rows in this order is problematic because the row in the Titles table must store the value from the Authors table's identity column, which is generated when the row is inserted into the Authors table. To solve this problem, you need to add the inverse attribute with the value true to the sql:relationship annotation, as callout A shows. The inverse attribute tells XML Bulk Load to reverse the order it creates the rows in. After you add the inverse attribute, the mapping schema is complete.

Pulling the Pieces Together

The remaining task is to create a program that uses the XML Bulk Load COM to reference the mapping schema and load the data into the database. The VBScript code that Listing 3 shows is an example of one such program. The script creates the XML Bulk Load COM object, sets the connection string, specifies the file location for an error log, and sets the KeepIdentity property to false. KeepIdentity influences how XML Bulk Load stores identity values. If KeepIdentity is true (the default), XML Bulk Load stores the values specified in the XML document. If KeepIdentity is false, SQL Server generates the values when new rows are inserted. Although Figure 1's document doesn't include identity values, you need to set the value to false so that SQL Server will generate the value for the AuthId column in the Authors table. The final step in the script is to call the Execute method, which takes the paths to the mapping schema and data file as parameters. XML Bulk Load takes it from here and loads the data into your database.

To use the code for this example, first create a SQL Server database called Nov2003 and run the T-SQL code from Listing 1 to create the Titles and Authors tables. Then, save the data file from Figure 1 to a file called publications.xml, save the mapping schema from Listing 2 to a file called pubs.xsd, and save the VBScript code from Listing 3 to a file called XBL.vbs on your computer. You need to have Visual Basic Scripting Edition installed on your machine to execute the script. Before running the script, you need to change the connection string to match your local environment. Then, bring up a command prompt and run the VBScript code to load the data into the tables. You can verify the successful operation of XML Bulk Load by using Query Analyzer to query the rows from the Titles and Authors tables. Note that the values in the Titles table's AuthRef column contain the correct references to rows in the Authors table.

Loading very large XML documents into a normalized database that contains identity columns used to require you to write a potentially complicated program. As this example demonstrates, Microsoft has enhanced the value of XML Bulk Load by adding the ability to automatically store values from identity columns in foreign keys.

Comments

Plain text