From: Bhushan Mandhani (bhushan@cs.washington.edu)
Date: Wed May 05 2004 - 11:22:33 PDT
Processing XML Streams with Deterministic Automata
The paper proposes a solution to the problem of applying a large number of
XPath expressions to a XML stream. A single DFA is constructed, and then
applied to the XML stream. The main issue here is, this was not thought
possible because of an exponential blowup in the number of DFA states with
the number of expressions.
The paper starts out by showing upper bounds for the number of states. For
the eager DFA, the bound is exponential in the number of expressions
containing the descendant axis, and thus eager DFA's are ruled out as
unfeasible. For the lazy DFA's, bounds are separately given for the cases
when the data is characterized by a simple graph schema or a data guide.
These bounds convincingly establish the feasibility of using lazy DFA's.
The best thing about these bounds is that they are independent of the
number of XPath expressions, and instead depend on parameters which
characterize the data itself.
This is then backed up by solid experiments, which verify that the number
of states generated in a lazy DFA is very manageable, for a variety of
synthetic and real datasets. The number of states is particularly small
for real datasets. Further, the data throughput is found to be independent
of the number of XPath expressions, which is again remarkable. The bounds
above hinted this might be the case, and the experiments verified this.
Further, the throughput is orders of magnitude better than that of the
other method, XFilter.
Overall, the paper has interesting theoretical results which establish the
feasibility of using lazy DFA's for efficiently processing XML streams,
and this is convincingly established by the experiments.
This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 11:22:34 PDT