Programming XPath

In last week's column, I concluded my XML Path Language (XPath) discussion by suggesting that XPath is the first significant step toward a standard and powerful way to navigate through XML forests of nodes and attributes. In this week's column, I examine the XPath syntax as Microsoft has implemented it in Microsoft XML Parser (MSXML) 3.0 and the .NET platform.

An XPath string identifies a group of XML nodes. From this point of view, an XPath string is much the same as the argument of the MS-DOS dir command. For example, the following dir command returns all the files with a .doc extension in the current directory:

dir *.doc

Similarly, the XPath statement

customers/*/name

returns all the customers' names, given the following XML document:

<customers>
    <customer>
    <name>ACME Corp</name>
	<address>1234 One Way</address>
	<orders>
	   <order>6</order>
	   </orders>
	</customer>
  <customer preferred="yes">
    <name>Nothing Hill Ltd</name>
	<address>321 First Ave</address>
	<orders>
	   <order>1</order>
	   <order>2</order>
	   <order>5</order>
	 </orders>
 </customer>
 <customer>
    <name>eYou</name>
	<address>111 Second Ave</address>
 </customer>
 <customer preferred="yes">
   <name>EarthWindFire Corp</name>
    <address>1 Their Way</address>
	<orders>
	   <order>3</order>
	   <order>4</order>
	   </orders>  
	  </customer>
</customers>

In VBScript, you include the XPath statements as follows:

set xml = CreateObject("MSXML2.DOMDocument")
xml.async = false
xml.setProperty "SelectionLanguage", "XPath"
xml.load("data.xml")


set xpath = xml.selectNodes("customers/*/name")
xpath.context = xml


buf = ""
bContinue = True
while (bContinue)
    Set node = Nothing
    Set node = xpath.nextNode()
    if Not (node Is Nothing) then 
       buf = buf & node.Text & vbCrLf
    else
      bContinue = False
   end if
wend
MsgBox buf

A few things in the preceding code need further explanation. After you create the MSXML instance and set the async property to false to allow for synchronous loading of the Document Object Model (DOM), you set the query language to XPath. The default is XSLPatterns, a language similar to XPath and based on a very early draft of today's XPath.

The second step is to specify the desired nodes with the selectNodes method and the XPath query string. If you want to stop at the first node for each search, you use the selectSingleNode method instead of selectNodes. You can also set the context for the query. In the code above, I've used the statement

xpath.context = xml

for illustration; it isn't necessary, however, because the context always defaults to the root node.

The nextNode XML method lets you walk through the forest of the retrieved nodes. Note that the Text property returns the whole text of the subtree, which means that if you locate a single final node, such as name, node.Text produces a meaningful string. Otherwise, node.Text could produce a concatenation of all the children nodes' text.

If you replace

customers/*/name

with

customers/*\[name and address\]

you get a string of each customer's name and address. The only way you can separate them is to use the node.Xml property and use the DOM again.

When you search for nodes, you can use a number of predefined functions and syntax shortcuts. For example, the \[\] operator lets you selectively access nodes by choosing only those that have children with the specified name. For example,

customers/*\[name and address\]

means all the children of customers with both name and address subnodes. You can use any combination of Boolean operators, including NOT and OR.

The \[\] operator is also useful when you want to select nodes based on the presence of one or more attributes. You prefix the attribute's name with the at (@) character. For example,

customers/*\[@preferred\]/name

selects the names of all the children of customers with the "preferred" attribute set (from our sample XML document above, this statement returns Nothing Hill Ltd and EarthWindFire Corp). Within the \[\] operator, you can also use less than (<), greater than (>), and equal (=) operators. You can address nodes by number (e.g., the second) and by logical position (e.g., the last). The correct syntax is shown below:

customers/*\[@preferred\]\[2\]/name
customers/*\[@preferred\]\[last()\]/name
customers/*\[@preferred\]\[index() <2\]/name

You use the index()function against a constant to select the first (or the last) specified number of nodes. If you want to know about the Nth customer's name, you must specify the index at the customer level. If you put the index after name in the example above, it means the second or the last node called name. Bear in mind that when making comparisons, you can also specify an XPath string to compare, say, one node with a child or a relative.

I've only scratched the surface of XPath syntax. However, at least a couple of things should be clear. One is that the XPath syntax is still awkward and significantly worse than the MS-DOS syntax. The other is that, despite appearances, XPath is easy to understand and powerful. It isn't the final step, though. The definitive XML query language is still some distance off.

Comments

Plain text