How SAX Works

My two previous columns introduced the Simple API for XML (SAX), a programming interface for applications that need to parse XML documents. Recall that SAX is a lightweight alternative to XML Document Object Model (XMLDOM) parsers. SAX is particularly suited to dealing with very large documents and with situations where you need to extract some information and discard the rest. Because a SAX parser usually pushes data to client applications, the whole SAX model is called a "push model." By contrast, the XMLDOM model is called a "pull model" because it can pull out data from the source and expose it to the application. In this column, I look briefly at how a SAX parser works on XML documents.

The SAX parser reads through an XML file and generates events based on the specific symbols it finds. It fires events for opening and closing tags and for the characters that form the text of each tag. SAX also lets your application know when it finds a processing instruction (which can occur before or after the main document element), when the document begins, and when the document ends.

With Microsoft XML Parser (MSXML) 3.0, Microsoft provides a COM-based programming interface for implementing the SAX model with both Visual Basic (VB) and Visual C++. MSXML 3.0's SAX implementation doesn't perform any validation, so with MSXML, SAX skips any external Document Type Definition (DTD) you refer to in your XML source. However, an event also signals any skipped entity.

The SAX architecture comprises a number of software modules called handlers, the most important of which is the content handler. Applications interact with the SAX parser by registering their own handlers. The underlying platform determines the implementation of those handlers. In Windows, for example, such handlers are implemented as COM objects, as the following code snippet illustrates:

Dim reader As New SAXXMLReader
Dim contentHandler As New ContentHandlerImpl

The first line creates an instance of the SAX parser object. The second line instantiates a VB class (i.e., a COM object) that implements the required interface for handling the XML document's content as it reveals itself to the parser. You could choose not to implement the content handler, but believe me, your application wouldn't be particularly useful. The content handler is the key to manipulating the content of the XML document you're processing. Not implementing the content handler means that you have very little to do with the document you're parsing.

Other handlers that you should learn about are the error and the DTD handlers. These handlers manage the kind of information you'll most probably use during processing of an XML document.

A SAX parser never creates a tree structure for the document in memory. However, you can programmatically instruct it to do so. Because the application that calls SAX receives an event each time the parser encounters a significant XML symbol, the application can use this information to create a tree structure in memory.

SAX is relatively new to Windows, and MSXML represents the first serious attempt to endow Windows with a powerful and effective SAX. If you're interested, you can check out the original SAX specification on the Megginson Technologies Ltd. Web site.

See also, "Writing SAX Applications."

Comments

Plain text