From: CR (chrisre@cs.washington.edu)
Date: Wed May 05 2004 - 08:21:49 PDT
This paper gives analytical justification for the use of DFAs in
processing XPath expressions. It is clear that you would want DFAs to
process such expressions, since there is such a great mapping and they
are such a primitive unit of computation (which is usually fast). Prior
to this work, it seems that people believed the exponential blow-up of
going from an NFA to a DFA would kill the use of these in practice. The
authors recognized that only a small fraction of these exponentially
many paths could appear in practice.
This is a very nice insight for databases, while there are
exponentially potentially satisfying instances of a path expression -
you are dealing over a necessarily finite and fixed instance. This
instance will have some regularity, and exploiting only those features
which are actually found will provide huge savings. They were able to
get bounds on these through the use of data guides. These data guides
are not necessary in the runtime because they can be inferred during
execution.
The authors were also able to quantify where the exponential
blow-ups resulted and make convincing arguments that such queries are
unlikely in practice. Further they show that in practice their bounds
are not tight - which is nice from an implementation stand-point (huge
data-guides like treebank result in much smaller NFAs.
One thing that confused me is why the treebank run ran out of
memory. The authors mention the ability to prune unused states. This
immediately brings a caching like trade-off into the implementation.
Clearly the 44k representation of the DFA can fit in memory (there were
2Gs of memory) and it is a minuscule space loss to put in reference
counting to implement a bare-bones replacement policy. I would have
liked some more convincing evidence that this was just implementation
and not a problem of the algorithm - such as a quantification of what
paths caused the blow up.
This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 08:21:52 PDT