Lazy DFAs

From: CR (chrisre@cs.washington.edu)
Date: Wed May 05 2004 - 08:21:49 PDT

  • Next message: Neva Cherniavsky: "Processing XML Streams with Deterministic Automata"

        This paper gives analytical justification for the use of DFAs in
    processing XPath expressions. It is clear that you would want DFAs to
    process such expressions, since there is such a great mapping and they
    are such a primitive unit of computation (which is usually fast). Prior
    to this work, it seems that people believed the exponential blow-up of
    going from an NFA to a DFA would kill the use of these in practice. The
    authors recognized that only a small fraction of these exponentially
    many paths could appear in practice.

        This is a very nice insight for databases, while there are
    exponentially potentially satisfying instances of a path expression -
    you are dealing over a necessarily finite and fixed instance. This
    instance will have some regularity, and exploiting only those features
    which are actually found will provide huge savings. They were able to
    get bounds on these through the use of data guides. These data guides
    are not necessary in the runtime because they can be inferred during
    execution.

        The authors were also able to quantify where the exponential
    blow-ups resulted and make convincing arguments that such queries are
    unlikely in practice. Further they show that in practice their bounds
    are not tight - which is nice from an implementation stand-point (huge
    data-guides like treebank result in much smaller NFAs.
       
        One thing that confused me is why the treebank run ran out of
    memory. The authors mention the ability to prune unused states. This
    immediately brings a caching like trade-off into the implementation.
    Clearly the 44k representation of the DFA can fit in memory (there were
    2Gs of memory) and it is a minuscule space loss to put in reference
    counting to implement a bare-bones replacement policy. I would have
    liked some more convincing evidence that this was just implementation
    and not a problem of the algorithm - such as a quantification of what
    paths caused the blow up.


  • Next message: Neva Cherniavsky: "Processing XML Streams with Deterministic Automata"

    This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 08:21:52 PDT