Review of Relational Databases for Querying XML Documents

From: Neva Cherniavsky (nchernia@cs.washington.edu)
Date: Mon Apr 19 2004 - 07:57:47 PDT

  • Next message: Michael Gubanov: "RDBXML"

    This paper explores the use of relational databases for querying XML
    documents. XML is a common standard for representing data on the web.
    XML is self-describing, so the data within the tags is described by the
    tags themselves. It is semi-structured data, so one approach to
    extracting data from XML has been to use semi-structured query languages.
    However, this ignores all the work that has been done on relational
    databases (and the solid theory that backs them).

    The authors use relational databases to query XML by relying on the DTD,
    which describes the data. First, they process a DTD to generate a
    relational schema. To do this, they flatten the descriptions and simplify
    and group transformations. They then use inlining to convert the DTD to
    relations. To deal with set-valued attributes and recursion, they create
    a DTD graph and show how to traverse it to obtain the relations.

    The authors then parse XML documents conforming to DTDs and
    load them into tuples of relational tables in standard DBMS. First, the
    relations corresponding to the start of the root path expression are
    identified and added to the FROM clause of the SQL query. Then, if
    necessary, the path expressions are translated to joins among relations.
    They then translate semi-structured queries over XML documents into SQL
    queries over the relational data. A complication here is handling
    wildcard queries, which specify "all reachable from any path". These
    queries must be translated to a union of two SQL fragments within a least
    fix-point operator.

    Finally, they convert the results back to XML. Simple structuring is easy
    and comes naturally from the relational database. Grouping results cannot
    be done in the same way as in DBMS (because it's not possible to choose
    the appropriate item to group on), which means using XML and DBMS results
    in losing some of the helpful methods from DBMS. And complex element
    construction is difficult and inefficient, requiring either replication of
    what a database does best outside of the database, or an inefficient
    series of joins.

    Because of this last problem, the authors propose extensions to the
    relational model for XML. These extensions would include support for
    sets, untyped or variable-typed references, information retrieval style
    indices, flexible comparison operators, more powerful recursion, and
    multiple query optimization and execution.

    I thought this paper was quite well-written, and dealt with an important
    current issue for DBMS. XML seems to be here to stay, even though it
    isn't the best way of representing data for a database. I agree with the
    authors that the best choice is to extend current DBMS technology to deal
    with some of the issues they raise; databases are extremely good at
    query/retrieve, and have lots of theory backing them. Certainly the
    authors raise important problems, but the solution should not be to
    reinvent the wheel in the form of semi-structured database management
    systems.


  • Next message: Michael Gubanov: "RDBXML"

    This archive was generated by hypermail 2.1.6 : Mon Apr 19 2004 - 07:57:49 PDT