From: Neva Cherniavsky (nchernia@cs.washington.edu)
Date: Mon Apr 19 2004 - 07:57:47 PDT
This paper explores the use of relational databases for querying XML
documents. XML is a common standard for representing data on the web.
XML is self-describing, so the data within the tags is described by the
tags themselves. It is semi-structured data, so one approach to
extracting data from XML has been to use semi-structured query languages.
However, this ignores all the work that has been done on relational
databases (and the solid theory that backs them).
The authors use relational databases to query XML by relying on the DTD,
which describes the data. First, they process a DTD to generate a
relational schema. To do this, they flatten the descriptions and simplify
and group transformations. They then use inlining to convert the DTD to
relations. To deal with set-valued attributes and recursion, they create
a DTD graph and show how to traverse it to obtain the relations.
The authors then parse XML documents conforming to DTDs and
load them into tuples of relational tables in standard DBMS. First, the
relations corresponding to the start of the root path expression are
identified and added to the FROM clause of the SQL query. Then, if
necessary, the path expressions are translated to joins among relations.
They then translate semi-structured queries over XML documents into SQL
queries over the relational data. A complication here is handling
wildcard queries, which specify "all reachable from any path". These
queries must be translated to a union of two SQL fragments within a least
fix-point operator.
Finally, they convert the results back to XML. Simple structuring is easy
and comes naturally from the relational database. Grouping results cannot
be done in the same way as in DBMS (because it's not possible to choose
the appropriate item to group on), which means using XML and DBMS results
in losing some of the helpful methods from DBMS. And complex element
construction is difficult and inefficient, requiring either replication of
what a database does best outside of the database, or an inefficient
series of joins.
Because of this last problem, the authors propose extensions to the
relational model for XML. These extensions would include support for
sets, untyped or variable-typed references, information retrieval style
indices, flexible comparison operators, more powerful recursion, and
multiple query optimization and execution.
I thought this paper was quite well-written, and dealt with an important
current issue for DBMS. XML seems to be here to stay, even though it
isn't the best way of representing data for a database. I agree with the
authors that the best choice is to extend current DBMS technology to deal
with some of the issues they raise; databases are extremely good at
query/retrieve, and have lots of theory backing them. Certainly the
authors raise important problems, but the solution should not be to
reinvent the wheel in the form of semi-structured database management
systems.
This archive was generated by hypermail 2.1.6 : Mon Apr 19 2004 - 07:57:49 PDT