From: Alexander Moshchuk (anm@cs.washington.edu)
Date: Wed May 05 2004 - 10:43:48 PDT
The paper describes a way to use DFAs to evaluate a large number of
XPath queries on an XML stream. This is basically a way to
progressively process a large amount of XML data before the full
document is received; as an example, this is useful if the data is
sent to multiple consumers who subscribe to certain parts of the data,
and thus the data must be filtered in real time.
The basic approach is to convert all of the XPath expressions into a
single DFA and then run through the DFA based on the input XML stream.
Processing with a DFA is simple and efficient - all that's required is
a simple statically-allocated stack, and each new input is processed
in O(1) time. The major question to be addressed is keeping the DFA
size manageable, as the number of states could easily blow up
exponentially with the number of XPath expressions. The key idea here
was to use a lazily-constructed DFA - the DFA states would be
constructed dynamically at run-time based on transitions into missing
states. The paper theoretically shows that in this case, the
worst-case number of states doesn't depend on the number of XPath
expressions; it only depends on their depth, which is usually not big.
In practice, the number of states is also lower than the theoretical
upper bound, and overall lazy DFAs presents tremendous space savings
for typical sets of XPath expressions.
The paper provides some performance evaluation, which verifies that a
lazy DFA obtains a constant throughput for varying numbers of XPath
expressions. The throughput is up to 10,000 times better than
existing systems on big XML datasets. It would have been nice to see
some more numbers here. For example, the publish-subscribe scenario
briefly described in the beginning of the paper seems like a
particularly interesting application that could be implemented and
evaluated, comparing this paper's method with others. Overall,
though, I was impressed by the results that were provided.
This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 10:43:54 PDT