From: Charles Giefer (cgiefer@comcast.net)
Date: Wed May 05 2004 - 07:39:51 PDT
Processing XML Streams with Deterministic Automata
Green, Miklau, Onizuka, and Suciu
This paper contributes a significant improvement in the efficient processing
of XML streams. It gets this improvement by using several clever techniques
and making some general observations about the common case.
First, an arbitrary number of XPath expressions are reduced to one finite
automata. This allows for two benefits. One is that the time to evaluate
the expressions is no longer dependant on the number of expressions but
rather on the size of the finite automata. The other benefit is that the
finite automatas can be easily navigated using SAX events and a simple
stack. Therefore, events can be processed in constant time giving a fixed
throughput.
Second, many of the deterministic finite automata (DFA) states are not used
in the common case. The real data that is represented doesn't really have
infinite structures (it's impractical to have
//book/author/book/author/book.). Therefore, much of the flexibility
allowed by the language is not necessary. This paper takes this into
account by doing lazy evaluation of the DFA states. Therefore, the DFA is
generated on the fly. This leads to a lower startup cost and on average
much fewer structures in memory. However, a little higher cost is paid
during execution for the generation of parts of the DFA.
Those are the two major contributions of the paper. The goal was to take
something that was highly exponential and optimize it so that it is roughly
linear/polynomial in the common case. Their results are still exponential,
but with regard to the depth of the DFA and the recursion of the DFA, which
are small in the common case. It would have been beneficial for me to see
some more results in this paper. Since the paper is mostly about a
performance optimization, it would have been nice to see a bit more
performance analysis over more datasets and queries. Also, it would have
been beneficial to get a better snapshot of how many resources the process
takes up (memory, processor, network) to see where the bottlenecks are.
As stated by the paper, this can be used as a filter to provide content to a
variety of hosts from the same data source. This could have many uses in
the real world. It would be interesting to see if the techniques here could
be converted from a CPU/software design to a hardware filter design.
Therefore, it could be added to routers and incorporated into the
functionality of packet routing and filtering. This may be an interesting
aspect to explore.
This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 07:57:44 PDT