Processing XML Streams with Deterministic Automata

From: Alexander Moshchuk (anm@cs.washington.edu)
Date: Wed May 05 2004 - 10:43:48 PDT

  • Next message: Bhushan Mandhani: "Paper 6 review"

    The paper describes a way to use DFAs to evaluate a large number of
    XPath queries on an XML stream. This is basically a way to
    progressively process a large amount of XML data before the full
    document is received; as an example, this is useful if the data is
    sent to multiple consumers who subscribe to certain parts of the data,
    and thus the data must be filtered in real time.

    The basic approach is to convert all of the XPath expressions into a
    single DFA and then run through the DFA based on the input XML stream.
    Processing with a DFA is simple and efficient - all that's required is
    a simple statically-allocated stack, and each new input is processed
    in O(1) time. The major question to be addressed is keeping the DFA
    size manageable, as the number of states could easily blow up
    exponentially with the number of XPath expressions. The key idea here
    was to use a lazily-constructed DFA - the DFA states would be
    constructed dynamically at run-time based on transitions into missing
    states. The paper theoretically shows that in this case, the
    worst-case number of states doesn't depend on the number of XPath
    expressions; it only depends on their depth, which is usually not big.
    In practice, the number of states is also lower than the theoretical
    upper bound, and overall lazy DFAs presents tremendous space savings
    for typical sets of XPath expressions.

    The paper provides some performance evaluation, which verifies that a
    lazy DFA obtains a constant throughput for varying numbers of XPath
    expressions. The throughput is up to 10,000 times better than
    existing systems on big XML datasets. It would have been nice to see
    some more numbers here. For example, the publish-subscribe scenario
    briefly described in the beginning of the paper seems like a
    particularly interesting application that could be implemented and
    evaluated, comparing this paper's method with others. Overall,
    though, I was impressed by the results that were provided.


  • Next message: Bhushan Mandhani: "Paper 6 review"

    This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 10:43:54 PDT