Editor's Note: Send your XML questions to Rich Rollman at [email protected]
Suppose you have an XML document that contains data you need to load into SQL Server. Do you use XML Bulk Load to import the data, or do you write a stored procedure that uses OpenXML to load the data?
This month, I outline some factors you need to consider when you're making these kinds of decisions and show you how to efficiently use OpenXML to load XML from a file into SQL Server. In addition, a question about how to constrain an attribute's value opens the subject of XML Schemas, which is soon to be a very hot topic. The World Wide Web Consortium (W3C) expects to approve XML Schemas as a W3C recommendation by the time you read this column.
I have an XML document in a file, and I need to extract information from that document and store the information in several related tables in my SQL Server database. How do I import the data from my XML document?
The XML features in SQL Server 2000 provide two possible answers to this question: You can use XML Bulk Load, which is part of XML for SQL Server 2000 Web Release 1, or you can use OpenXML. Your choice depends on several factors. If the XML document has an unstructured format (i.e., it contains interleaved markup, an irregular format, or a mix of formats in an aggregated document), you might prefer to use OpenXML.
In conjunction with the capabilities of the T-SQL language in a stored procedure, OpenXML provides greater flexibility than XML Bulk Load, which requires that you build a mapping schema for unstructured XML documents—a real challenge. However, if the XML document is large enough (i.e., greater than 50KB), XML Bulk Load is probably a better solution because it provides optimization for loading large documents.
Another factor that you should consider is how frequently anyone will use the code you develop. You should weigh the cost of development, the ease of implementation, and the developer's familiarity with specific coding methodologies (e.g., T-SQL versus XML-Data Schemas) against runtime performance. XML Bulk Load requires more XML knowledge than OpenXML requires to construct the mapping schema. OpenXML leverages T-SQL programming, treating the XML document link as a rowset against which you can run SQL queries.
Unfortunately, no magic algorithm determines the correct choice for you. You should carefully consider each factor in relation to the project and the skill set of the person implementing the solution.
In my June 2001 column, I showed you how to use XML Bulk Load to load data into SQL Server. Now, let's use OpenXML and ADO to pass an XML document to a stored procedure that loads the data into SQL Server.
A T-SQL stored procedure must use OpenXML instead of XML Bulk Load because T-SQL can't load a file into a string or stored procedure. Generally, a stored procedure that uses XML receives an XML document as a string. The stored procedure then parses the XML document and runs one or more OpenXML queries on the document to perform the required task. The trick to calling the stored procedure lies in how you pass the XML document as a parameter.
In my February 2001 column, I showed you how to use OpenXML. But I assumed that you already had the XML document in a string variable, so I didn't delve into the details of how to call the stored procedure.
The most straightforward approach you can take to passing an XML document as a parameter is to allocate a huge string to hold the file's contents—obviously inefficient. In addition, writing code that inadvertently copies parts of a file or the entire file multiple times while building a string and passing it to SQL Server is easy to do.
An easier and more efficient approach to passing an XML document as a parameter is to use the ADO 2.6 Stream object's CommandStream property to pass a command to SQL Server through the SQL Server OLE DB driver. The Stream object contains the XML document, which OpenXML passes to the stored procedure and processes. Listing 1 contains a sample XML document to pass to the stored procedure. To keep the example simple, the XML document is very structured and straightforward. Having a more complex XML document would complicate the OpenXML code required to extract the data from the XML document.
Listing 2 contains the T-SQL code for the runOpenXML stored procedure. Note that the stored procedure declares an ntext variable, @xmlDoc, which contains the XML document's contents. The code then passes the XML document as a parameter to the sp_xml_preparedocument stored procedure. This stored procedure then parses the XML document and prepares it for OpenXML's use. The OpenXML statement stores the name and state from the XML document's author data into the Authors table. Listing 3 shows the appropriate schema for the Authors table.
Microsoft designed ADO's CommandStream property specifically to work with SQL Server 2000's XML support. CommandStream accepts an XML template, which is an XML document with embedded queries. In this example, the XML template contains the stored-procedure call that Listing 2 shows.
Using a template in this way is similar to how the SQL XML Internet Server API (ISAPI) DLL executes an XML template requested through Microsoft IIS. In fact, both of these methods use the same services from the SQL Server OLE DB provider. Listing 4 shows the Visual Basic (VB) code that dynamically constructs the XML template. This template executes the stored procedure.
The VB module declares three ADO Stream objects. StrmIn reads the XML document, strmCmd contains the XML template that the ADO CommandStream property receives, and strmOut stores the stored procedure's output (i.e., the XML document that the stored procedure returns).
After opening all the Stream objects, you use the LoadFromFile method on strmIn to load the XML document from disk. Then, strmCmd builds the template, which consists of one <sql:query ...> tag to execute the stored procedure. Note that strmCmd builds the template only as far as the inclusion of the XML document. Then, you copy the contents of the XML document from strmIn to strmCmd.
Note the use of CDATA at callout A in Listing 4. The CDATA section is necessary to avoid substituting character entities for the XML tags (e.g., substituting < for '<') in the document. After you copy the XML document into the Stream, you close the CDATA section and provide the end tag to complete the template. Finally, you execute the ADO Command object to process the template and obtain the results.
Although the ADO Stream example avoids explicitly allocating strings, it still buffers the XML document in memory within the ADO Stream. Although the ADO Stream can load data into SQL Server more efficiently than VB code can, be aware of the memory consumption on both the middle tier and the SQL Server machine when you use this approach. I recommend this solution only for small documents or when you need to load data from an XML document only infrequently. In these cases, the simplicity of this solution is ideal.
Using XML Schemas, how do you constrain the range of values that an attribute value can store?
XML Schemas let you describe the structure of an XML document, such as what elements are allowed in the document, how to nest the elements, and the attributes that are allowed on each element among a slew of other properties. You can also use XML Schemas to describe other meta-information about the elements' content and attributes in a document (e.g., data type, cardinality, sequence).
An XML Schema is an XML document that uses a specific vocabulary—a set of elements and attributes—to specify the vocabulary of another XML document. An XML Schema is an XML document that describes the structure and the data contained within an XML document.
Just as a relational schema describes your database structure (for example, tables, columns, and data types), an XML Schema describes an XML document's data structure (such as elements, sub-elements, attributes, and data types). And just as you can use CHECK constraints to limit the values that the columns in a table can contain, you can use the restrictions in XML Schemas to limit the values in XML elements and attributes.
To limit the range of attribute values, XML Schemas provide a user-defined data-type mechanism, which lets you define a data type that has a set of restrictions such as the maximum and minimum values for numeric types or the maximum length of a string value. XML Schemas provide different ways to specify a user-defined data type.
The example that Listing 5 shows uses an inline simpleType definition (which is contained directly within the attribute) to restrict the daysinyear attribute's value to 365 or 366. Note that the <restriction> tag specifies a base type of positive integers that the XML Schema further refines to have maximum and minimum values.
As XML Schemas become integrated with applications and start replacing the use of Document Type Definitions (DTDs), a thorough understanding of XML Schemas' capabilities and their usage within Web applications and SQL Server will be important. I look forward to answering your questions about XML Schemas in future editions of "Exploring XML."