11 - XML CSE 413 Lecture Notes

Introduction

A common theme this quarter is that of trying to separate various layers of an architecture by applying various abstractions and manipulating information at the level of the abstractions, rather than the details of the implementation. This can happen on many levels and is almost always a good idea. For example, we've seen that we can manipulate functions as just another data item, both in Scheme and Postscript, which has given us more flexibility in the way they are defined, stored and retrieved.

Another area where this kind of abstracted thinking is being applied in modern systems is in the representation of data. It turns out that there are numerous uses for hierarchical data structures that can be shown to be well formed (syntax) and valid (semantics). One of the benefits of well formed, valid data structures is that new languages can be written to manipulate these abstract data structures. In the past, data structures were generally ad-hoc in nature, and so the programs that manipulated them were also ad-hoc, custom designed for each application.

An important task in describing data in a well-defined fashion is that of separating the meaning of the data (strucure and content) from the presentation of the data (font, styles, layout, etc). When you actually start to think about this issue, you often realize that what you have been thinking of as content is actually an artifact of the particular presentation that you anticipate. The thinking required to separate the meaning from presentation is very similar to the thinking needed to separate architecture from implementation, an important aspect of good object oriented design.

When the information content of a data set is clearly defined, information subsets and summaries can be more easily created. Also, if the meaning of the information is recorded in a logical fashion, much of the processing can be done in an automated fashion by new programming languages relying on keywords and templates, instead of requiring human intervention. This enables new applications that would not have been practical previously.

The Extensible Markup Language (XML) is designed for representing structured data. It is used to define special-purpose markup languages, which in turn are used to describe some data domain.

One example of an XML based language is XHTML. XHTML is a restatement of HTML, using the structural rules of XML. A document that is properly organized is said to be well-formed. In addition to following the XML structuring rules, there may also be a rigorous definition of how the language is actually constructed (what are the element names, what are the valid attributes, what types of values can the attributes take, etc). In the case of XHTML, this definition is provided in the Document Type Definition (DTD). The definition can be referenced in the XML document, in which case the document can be validated against the definition. Such a document is said to be valid.

Note that an XML document can be well formed and very useful, even if there is no definition against which to validate it. A document which is not well formed is not acceptable to any XML parser and will be rejected with an error. This is an important difference from HTML, where the parser attempts to make sense out of the markup, no matter how bad it is.

Rules and Structures

One of the reasons that XML has become so popular is that it is relatively easy to understand the rules for making a well-formed document.

A document that follows these rules is well-formed and can be parsed by any XML parser. If it also includes a DTD reference, it can be validated against the DTD to confirm that it only includes data with known meaning.

Tools

Following from the fact that well-formed XML can be parsed by any XML parser, there are numerous application tools that can be used with XML. The tools do not have to be specifically designed for a particular application, and in fact they can be modularized and included as features of any program.

Since XML is all text, any editor that can produce raw text can be used to write XML. The language is a little verbose, so it can get tedious, but especially for small files and simple applications it's easy to do.

Most modern programming editors have specific modes for working with XML and have useful features like syntax coloring and element tag completion. jEdit is an example of this sort of editor. Most large Integrated Development Environments (IDE) like Dreamweaver and MS Visual Studio also provide XML oriented editing environments.

There are also specific tools for working explicitly with XML. For example, XMLSpy is a very capable XML oriented editor with good display and analysis capabilities. You can download a free copy of XMLSpy from the vendor's website for use in the class. Lantern is an OpenSource software application which allows users to load XML documents and then test XPath expressions.

Since XML structure is well defined, separate parsers can be written and incorporated in the development of other tools. An XML parser (or processor) reads an XML document and verifies that it is well formed. It may also check that it is valid, but this is not required and is often not done. Once the document is parsed, it is converted into a tree of elements (the DOM) that can be processed by any application program that can talk to the parser. The beauty of this is that the application program doesn't need to include any parsing code of its own. In fact, a program can often be built by combining a couple of files that are read by the parser and no coding is necessary at all. In other cases, some coding is needed, but just enough to tell the parser what to do. This makes rapid development of applications much easier and less error-prone.

There are numerous XML parsers available, including

In addition to the standalone XML parsers listed above, modern browsers and other "user agents" are including the ability to parse and manipulate XML directly. For example, Firefox and Internet Explorer both display this simple example date.xml with syntax coloring and full knowledge of the tree structure of the file. water.xml is a slightly more complex example from the XML Bible, by ER Harold. This example is written in a particular XML dialect (or application) called Chemical Markup Language (CML) that defines particular tags to describe complex chemical objects.

It is now quite common for applications to store their setup and configuration data in XML structured files. (Perform a search of C:\Documents and Settings for files with the .xml extension to see many examples of this.)

Many standard desktop applications also use XML for their basic data, either as an import/export data format, or as a fundamental means of tracking their data. For example, Microsoft has made significant additions in their MS Office products to support XML use as a basic data exchange tool. For example, this simple spreadsheet (excel-snap.png) can be saved in XML format (sample.xml) and then read and parsed by any XML parser (eg, XMLSpy snapshot).

Related technologies

As you can easily imagine, once it became clear that XML was easy to create and parse, many applications of the language have sprung up. With them came the need to be able to process the elements in the document tree using standard tools and techniques. Two technologies of particular interest to us are:

XPath

The primary purpose of XPath is to address parts of an XML document. It also provides basic facilities for manipulation of strings, numbers and booleans. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document.

XPath expressions can be used to identify a particular node or set of nodes in a document tree model. The basic concept is similar to that used in identifying directories in a standard hierarchical file system, but there are lots of extra doo-dads that let you modify how the nodes are selected and manipulated.

The primary syntactic construct in XPath is the expression. An expression is evaluated to yield an object, which has one of the following four basic types:

The following discussion relates all the paths shown to the twoelements.xml example taken from listing 15-1 in the XML Bible. (A much longer example is also available: allelements.xml. We can use the Firefox DOM Inspector to view the tree, and Lantern to evaluate XPath expressions and see which nodes are selected.

The simplest XPath expression starts at the root of the document and identifies that node (the root node). Children of a node are matched by specifying their name or type.

See the W3C XPath Standard and the XML Bible, chapter 15, XPath Expressions for Selecting Nodes, for more information and details on constructing more elaborate XPath expressions. Other resources are listed on the index page for this topic.