review

From: Li Yan (lanti@u.washington.edu)
Date: Sun Apr 18 2004 - 13:38:09 PDT

  • Next message: Atri Rudra: "Review #2"

    Relational Database for Querying XML Documents: Limitations
    and Oppportunities

    This paper describes the approach of Relational database for
    queries on XML documents.

    Given the fact that XML has becoming the pervasive format of
    data exchange between heterogeneous systems, and the wide
    accessibility of World Wide Web, an efficient implementation
    of querying XML documents is in great need. There, as the
    authors pointed, two general ways, one is to design a query
    system based on the semi-structure nature of XML documents,
    which has resulted in some good systems, the other is to
    translate XML documents into a relational database and hence
    to take advantage of the well-developed technology on SQL
    query and RDBMS. The latter is very attractive, because we
    have a matured technology in treating relational database
    tables and a well developed and implemented query language,
    SQL. However, the major difference between XML and
    relational database, the semi-structured documents and the
    rigid flat table structured presents great challenge to come
    up with a system that interoperates between the two.

    The authors' presents a practical and carefully devised
    approach:
    1. Generate table schema from DTD
    2. Parse XML documents then load valid XML documents into tuples.
    3. Translate queries on XML documents to their SQL
    equivalent
    4. Translate the result of SQL query to form a valid XML
    document.
    All four steps were discussed in great details with
    quantitative statistics available to validate the Hybrid
    approach of generating table schema from DTD.

    In the first step, simplifying DTD is a preprocessing to
    make the sometimes complicated DTD to a simpler format more
    compatible with relational database. The Basic Inline
    technique is a natural solution. It creates one node each
    for each element and includes as many desendants as possible
    into the table. It solves the fragmentation problem but has
    a significant deficiency to make it impractical in real
    life, too many relations were created and repeated with this
    scheme. To work around this problem, the Shared Inlining
    technique was presented. It solves the repeating relations
    problem by noticing the relations get repeated are precisely
    those that has in degree more than 1 in the graph. Thus they
    create a table for each, and each element will be converted
    to one relation in this scheme. But a drawback of Shared
    technique is the consequently many joins when querying the
    relational database. The Hybrid technique was then
    introduced to include some redundancies while trying to
    avoid the excessive joins.

    A quantitative study using the number of SQL queries and the
    number of joins in each were used to compare the performance
    of the above techniques. Basic generates too many relations
    for even moderately complex XML documents and thus were not
    included in the comparison. Shared and Hybrid have different
    performance on different XML structures. However, the group
    division is a result after conversion and
    queries. Therefore, it might not be clear beforehand about
    which one to choose. If the performance of both are close,
    which is the most frequent case in the authors samples,
    Shared technique should be the favorable one since it has a
    smaller size and the implementation of Shared and
    Hybrid is very similar.

    Converting semi-structures queries to SQL is generally
    forward, starting at the root's table and then proceeds by
    translating queries into joins. Complication arises when
    arbitrary complex path expressions are allowed, which is the
    case in practice.

    The conversion of SQL results back to XML is again straight
    forward in the simple structuring cases where we simply tag
    the data. Again XML with complex structures needs special
    treatment.

    In conclusion, the authors summarized a few recommendations
    to DTD and relational database implmentation to make this conversion
    perform better, stronger support of various types other than
    string, better support of set values and recursion, multiple
    query execution and optimization, here we might employ the
    parallel algorithms to speed up execution and dynamic
    programming technique to avoid repeated computation.

    While translating XML queries into RDBMS tables and SQL
    queries incures some overhead and has difficulties when
    dealing with complex XML documents or queries, the
    well-developed RDBMS theory and implementation makes it
    worth the effort. The future of this approach will also
    heavily depends on the progress of its opponent, the
    implementation that based on the semi-structure data model
    and query themselves.


  • Next message: Atri Rudra: "Review #2"

    This archive was generated by hypermail 2.1.6 : Sun Apr 18 2004 - 13:38:11 PDT