From: Atri Rudra (atri@cs.washington.edu)
Date: Wed May 05 2004 - 11:47:25 PDT
Processing XML Streams with Deterministic Automata
--------------------------------------------------
Todd J. Green et al.
The authors consider the problem of evaluating a large number of XPath
expressions on an XML stream. The basic idea is to represent all the
XPath expressions as a single deterministic finite automata (DFA).
The basic procedure is to create an NFA from the queries and then
to convert it to a DFA. Storing the whole DFA turns out to be infeasible
because of the large number of states. The authors propose doing the NFA
to DFA conversion on the fly: this has the advantage that only the states
required to be expanded are created. The authors provided theoretical
bounds on the number of states in the eager DFA
(the bound is exponential and the authors also show a matching lower bound)
and the lazy DFA (the bound depends on the maximum depth of any
XPath query and not on the total number of such queries). The authors
also provide experimental validation of their results.
Here are my questions/comments:
(*) The authors do not consider filters/predicates in the XPath queries
and mention they would slow down the throughput: are there studies
done to characterize this slowdown ?
(*) The whole DFA model assumes that the XML documents are well formed
(that is atleast the tags match). Are there ways to check this
assumption in a streaming model ?
(*) The authors also only consider query sets: can general queries be
stated equivalently in terms of query sets. If not how would the
results presented in this paper change ?
(*) It seems to me that the whole eager vs lazy DFA things is like a
space-time tradeoff (the eager DFA would take more space
but should be faster as it has everything precomputed while
the lazy DFA uses up lesser space but is slower as it has to
do the computations on the fly). Is there any merit to some
hybrid scheme: for example expand certain portions of the DFA
and expand the rest of it on the fly: maybe use some sort of
caching ?
This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 11:47:26 PDT