Using XML Schemas

DataStream

LANGUAGES: XML, C#

ASP.NET VERSIONS: All

Using XML Schemas

XSD Files Can Save You Time & Frustration

By Brian Noyes

The DataSet class makes it very easy to load XML data into an instance of the class, and then work with it as relational data. However, many questions arise when developers start loading nested XML elements into a data set. That's when a little understanding of what's going on under the covers goes a long way toward obtaining the results you expect.

When you read XML data into a data set, the resulting relational schema that's created will depend on two things: the shape of the XML data and whether you provide an XML Schema describing that shape before loading the data itself. To understand the different approaches and the resulting outcome, let's start with the basics.

Loading XML Schema and Data into a DataSet

The DataSet class has two methods you must consider when loading data: ReadXmlSchema and ReadXml (their names make the purpose of each method fairly easy to understand). If you provide an XML Schema Definition (XSD) file to the ReadXmlSchema method before you try to load the XML data with ReadXml, the data set will create a relational schema for itself that corresponds to the XML schema provided. When you then load the XML data with the ReadXml method, the data for elements and attributes corresponding to the provided schema will be loaded into the tables and columns that were created. If you read any XML elements or attributes that aren't part of the XML schema, they will be ignored.

In contrast, if you call ReadXml on an XML document without first calling ReadXmlSchema with a schema file, the data set will do its best to infer an appropriate relational schema based on the shape or structure of the XML it finds in the XML file that you tell it to read. The results you get may or may not match your expectations. There are also many forms of XML that the data set will be unable to load, particularly if a given element is present at multiple levels in the hierarchy of nodes.

When the data set creates a relational schema for XML data, it maps elements with child elements or attributes to tables, and elements containing text or attributes to fields on the containing table. Consider the example XML document shown in Figure 1, which contains a subset of data from the Customers table of the Northwind database.

<CustomerID>ALFKI</CustomerID>

<CompanyName>Alfreds Futterkiste</CompanyName>

</Orders>

</Orders>

</Customers>

<CustomerID>ANATR</CustomerID>

<CompanyName>Ana Trujillo</CompanyName>

</Orders>

</Orders>

</Customers>

</CustomersDataSet>

Figure 1: Sample data from the Customers table of the Northwind database as XML.

If you simply read this data in to a data set with ReadXml, the data set will contain two tables, Customers and Orders, corresponding to the elements it finds that contain other child elements. The XML in Figure 2 results in exactly the same relational schema.

<Customers CustomerID="ALFKI"

CompanyName="Alfreds Futterkiste">

</Customers>

<Customers CustomerID="ANATR"

CompanyName="Ana Trujillo">

</Customers>

</CustomersDataSet>

Figure 2: Sample data from the Customers table of the Northwind database as XML with attribute content instead of elements.

The other thing the data set will do while reading in data like this is look at the nested relationships of elements with other elements. When it sees that the Orders elements are child elements of the Customers elements, it will create a parent-child relation between the two tables with a corresponding foreign key constraint to maintain the nested nature of the data while it's being used as relational data.

Foreign Keys Are the Key

The thing that usually trips up developers when they're loading nested XML data into a data set is not understanding what the data set is using as a foreign key between parent and child tables. In the case of either of the XML documents shown in Figures 1 and 2, the data set doesn't have enough information to form a foreign key, so it adds a field to each of the two tables it creates to form that foreign key constraint. What it adds in that case is an integer field to each of the Customers and Orders tables that it names Customers_Id. It uses this field to create a primary key in the Customers table and a foreign key constraint from the Orders table to the Customers table using the corresponding field in the Orders table.

This has a couple of undesirable effects. One is that the relational data now has another field that you must manage. If you are inserting new data into the child table, you will have to determine what the corresponding parent row's Id column value is, and set that before adding the row. Otherwise, it won't be added as a child row of the parent. For example, to add an order to the customer whose CustomerID field is ANATR, you would need the following code:

DataSet ds = new DataSet();

ds.ReadXml("CustomersOrders.xml");

DataRow dr = ds.Tables["Orders"].NewRow();

dr["OrderID"] = 99999;

dr["Customers_Id"] = 1;

ds.Tables["Orders"].Rows.Add(dr);

The key extra step here that eludes many people is the need to set that additional Customers_Id field that was fabricated by the data set to the appropriate foreign key to make it a child of the appropriate parent Customers row.

The other undesirable effect is that the data may already contain a field that represents the real primary key for the parent table to which a foreign key should be set. If that is the case, then the solution lies in ensuring to first load an XML schema with that information embedded using the ReadXmlSchema method before loading the XML data with ReadXml.

You can easily create an XML schema for a given XML document using Visual Studio .NET. Simply load the XML document into the editor, and select Create Schema from the XML menu. Once you've created an XML schema for the document, open the XSD file in the editor and drag a Key component from the Toolbox onto the appropriate table (Customers in the sample XML above) and set the primary key on the appropriate field. Once you've done that, if you load the XSD with ReadXmlSchema, followed by loading the XML with ReadXml, the data set will no longer infer an extra field on the parent table; it will simply add the foreign key field in the child table to the specified primary key field in the parent, and you should get exactly the results you expect.

If you're loading the XML data into an XmlDataDocument instead of a data set, you'll have to call ReadXmlSchema on the DataSet property before loading data with the Load method. Otherwise, none of the loaded data will get added to the data set relational schema. The code below shows how to load a schema for the data set of an XmlDataDocument, before loading the XML document:

// Create the document.

XmlDataDocument doc = new XmlDataDocument();

// Get the data set reference.

DataSet ds = doc.DataSet;

// Read schema to use for relational data in document.

ds.ReadXmlSchema(@"..\..\NestedData.xsd");

// Load it in.

doc.Load(@"..\..\NestedData.xml");

If you don't provide the schema, the data would still be loaded, and accessible through the methods and properties of the base XmlDocument class, but the DataSet property would contain no tables. Even if you provide the schema, you must keep in mind the foreign key issues described earlier when adding rows to the generated tables. If you add a new row to the Orders table without setting the foreign key column to an existing row in the Customers table, the resulting Order element is created as a child of the root element of the XML document.

Brian Noyes is a software architect with IDesign, Inc. (http://www.idesign.net), a .NET-focused architecture and design consulting firm. Brian specializes in designing and building data-driven distributed Windows and Web applications. Brian writes for a variety of publications and is working on a book for Addison-Wesley on building Windows Forms Data Applications with .NET 2.0. Contact him at mailto:[email protected].

Comments

Plain text