Processing XML Streams

From: Charles Giefer (cgiefer@comcast.net)
Date: Wed May 05 2004 - 07:39:51 PDT

  • Next message: L. Yan: "review"

    Processing XML Streams with Deterministic Automata
    Green, Miklau, Onizuka, and Suciu

    This paper contributes a significant improvement in the efficient processing
    of XML streams. It gets this improvement by using several clever techniques
    and making some general observations about the common case.

    First, an arbitrary number of XPath expressions are reduced to one finite
    automata. This allows for two benefits. One is that the time to evaluate
    the expressions is no longer dependant on the number of expressions but
    rather on the size of the finite automata. The other benefit is that the
    finite automatas can be easily navigated using SAX events and a simple
    stack. Therefore, events can be processed in constant time giving a fixed
    throughput.

    Second, many of the deterministic finite automata (DFA) states are not used
    in the common case. The real data that is represented doesn't really have
    infinite structures (it's impractical to have
    //book/author/book/author/book.). Therefore, much of the flexibility
    allowed by the language is not necessary. This paper takes this into
    account by doing lazy evaluation of the DFA states. Therefore, the DFA is
    generated on the fly. This leads to a lower startup cost and on average
    much fewer structures in memory. However, a little higher cost is paid
    during execution for the generation of parts of the DFA.

    Those are the two major contributions of the paper. The goal was to take
    something that was highly exponential and optimize it so that it is roughly
    linear/polynomial in the common case. Their results are still exponential,
    but with regard to the depth of the DFA and the recursion of the DFA, which
    are small in the common case. It would have been beneficial for me to see
    some more results in this paper. Since the paper is mostly about a
    performance optimization, it would have been nice to see a bit more
    performance analysis over more datasets and queries. Also, it would have
    been beneficial to get a better snapshot of how many resources the process
    takes up (memory, processor, network) to see where the bottlenecks are.

    As stated by the paper, this can be used as a filter to provide content to a
    variety of hosts from the same data source. This could have many uses in
    the real world. It would be interesting to see if the techniques here could
    be converted from a CPU/software design to a hardware filter design.
    Therefore, it could be added to routers and incorporated into the
    functionality of packet routing and filtering. This may be an interesting
    aspect to explore.


  • Next message: L. Yan: "review"

    This archive was generated by hypermail 2.1.6 : Wed May 05 2004 - 07:57:44 PDT