From: Steven Balensiefer (alaska@cs.washington.edu)
Date: Wed May 05 2004 - 10:12:30 PDT
Processing XML Streams with Deterministic Automata
Clearly there were 2 main ideas in this paper. The first was that queries
over XML can be implemented with a DFA. This idea makes intuitive sense,
since querying XML is similar to string matching. However, the traditional
approaches for DFA construction could result in DFAs of exponential size.
At that point, the authors introduce the idea of the Lazy-DFA, which is
like a standard DFA, except that states are generated "on-demand" when
they are referenced. Computing only the states for transitions that
actually appear will naturally limit the size of the DFA. Consider an DTD
where an element could be one of 5 things. In an Eager DFA all 5
transitions might require their own states, even if 4 were never used.
The authors use graph schema (nodes are tags,edges inclusions) to prove
that the maximum size for a lazy-dfa is dependent only on the depth of the
Xpath expressions. As they say, this is truly a surprising result! This
means that a given lazy-dfa will have a constant size regardless of the
number of Xpath expressions.
They go on to discuss details of implementing the Lazy-DFA, such as the
need to maintain "NFA tables" at every state to allow new states to be
computed when new transitions are attempted. They also mention that
allowing "text()" or other constant functions in Xpath complicates the
process.
I was very impressed by this paper, both because of the 10^4 speedups and
the accessibility of the description. I don't claim to have much intuition
about the details in the proofs, but the ramifications were still well
explained. I think the most appealing part of this idea is the fact that
the entire process occurs online, so there is no requirement to know what
queries you want to run ahead of time.
This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 10:12:31 PDT