Lazy DFAs for XML processing

From: Steven Balensiefer (alaska@cs.washington.edu)
Date: Wed May 05 2004 - 10:12:30 PDT

  • Next message: Joe Xavier: "Review : Processing XML Streams with Deterministic Automata"

    Processing XML Streams with Deterministic Automata

     Clearly there were 2 main ideas in this paper. The first was that queries
    over XML can be implemented with a DFA. This idea makes intuitive sense,
    since querying XML is similar to string matching. However, the traditional
    approaches for DFA construction could result in DFAs of exponential size.

     At that point, the authors introduce the idea of the Lazy-DFA, which is
    like a standard DFA, except that states are generated "on-demand" when
    they are referenced. Computing only the states for transitions that
    actually appear will naturally limit the size of the DFA. Consider an DTD
    where an element could be one of 5 things. In an Eager DFA all 5
    transitions might require their own states, even if 4 were never used.

     The authors use graph schema (nodes are tags,edges inclusions) to prove
    that the maximum size for a lazy-dfa is dependent only on the depth of the
    Xpath expressions. As they say, this is truly a surprising result! This
    means that a given lazy-dfa will have a constant size regardless of the
    number of Xpath expressions.

     They go on to discuss details of implementing the Lazy-DFA, such as the
    need to maintain "NFA tables" at every state to allow new states to be
    computed when new transitions are attempted. They also mention that
    allowing "text()" or other constant functions in Xpath complicates the
    process.

     I was very impressed by this paper, both because of the 10^4 speedups and
    the accessibility of the description. I don't claim to have much intuition
    about the details in the proofs, but the ramifications were still well
    explained. I think the most appealing part of this idea is the fact that
    the entire process occurs online, so there is no requirement to know what
    queries you want to run ahead of time.


  • Next message: Joe Xavier: "Review : Processing XML Streams with Deterministic Automata"

    This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 10:12:31 PDT