From: Li Yan (lanti@u.washington.edu)
Date: Sun Apr 18 2004 - 13:38:09 PDT
Relational Database for Querying XML Documents: Limitations
and Oppportunities
This paper describes the approach of Relational database for
queries on XML documents.
Given the fact that XML has becoming the pervasive format of
data exchange between heterogeneous systems, and the wide
accessibility of World Wide Web, an efficient implementation
of querying XML documents is in great need. There, as the
authors pointed, two general ways, one is to design a query
system based on the semi-structure nature of XML documents,
which has resulted in some good systems, the other is to
translate XML documents into a relational database and hence
to take advantage of the well-developed technology on SQL
query and RDBMS. The latter is very attractive, because we
have a matured technology in treating relational database
tables and a well developed and implemented query language,
SQL. However, the major difference between XML and
relational database, the semi-structured documents and the
rigid flat table structured presents great challenge to come
up with a system that interoperates between the two.
The authors' presents a practical and carefully devised
approach:
1. Generate table schema from DTD
2. Parse XML documents then load valid XML documents into tuples.
3. Translate queries on XML documents to their SQL
equivalent
4. Translate the result of SQL query to form a valid XML
document.
All four steps were discussed in great details with
quantitative statistics available to validate the Hybrid
approach of generating table schema from DTD.
In the first step, simplifying DTD is a preprocessing to
make the sometimes complicated DTD to a simpler format more
compatible with relational database. The Basic Inline
technique is a natural solution. It creates one node each
for each element and includes as many desendants as possible
into the table. It solves the fragmentation problem but has
a significant deficiency to make it impractical in real
life, too many relations were created and repeated with this
scheme. To work around this problem, the Shared Inlining
technique was presented. It solves the repeating relations
problem by noticing the relations get repeated are precisely
those that has in degree more than 1 in the graph. Thus they
create a table for each, and each element will be converted
to one relation in this scheme. But a drawback of Shared
technique is the consequently many joins when querying the
relational database. The Hybrid technique was then
introduced to include some redundancies while trying to
avoid the excessive joins.
A quantitative study using the number of SQL queries and the
number of joins in each were used to compare the performance
of the above techniques. Basic generates too many relations
for even moderately complex XML documents and thus were not
included in the comparison. Shared and Hybrid have different
performance on different XML structures. However, the group
division is a result after conversion and
queries. Therefore, it might not be clear beforehand about
which one to choose. If the performance of both are close,
which is the most frequent case in the authors samples,
Shared technique should be the favorable one since it has a
smaller size and the implementation of Shared and
Hybrid is very similar.
Converting semi-structures queries to SQL is generally
forward, starting at the root's table and then proceeds by
translating queries into joins. Complication arises when
arbitrary complex path expressions are allowed, which is the
case in practice.
The conversion of SQL results back to XML is again straight
forward in the simple structuring cases where we simply tag
the data. Again XML with complex structures needs special
treatment.
In conclusion, the authors summarized a few recommendations
to DTD and relational database implmentation to make this conversion
perform better, stronger support of various types other than
string, better support of set values and recursion, multiple
query execution and optimization, here we might employ the
parallel algorithms to speed up execution and dynamic
programming technique to avoid repeated computation.
While translating XML queries into RDBMS tables and SQL
queries incures some overhead and has difficulties when
dealing with complex XML documents or queries, the
well-developed RDBMS theory and implementation makes it
worth the effort. The future of this approach will also
heavily depends on the progress of its opponent, the
implementation that based on the semi-structure data model
and query themselves.
This archive was generated by hypermail 2.1.6 : Sun Apr 18 2004 - 13:38:11 PDT