review

From: Alexander Moshchuk (anm@cs.washington.edu)
Date: Wed May 19 2004 - 11:48:50 PDT

  • Next message: Michael Gubanov: "QEval"

    This paper presents a survey of efficient algorithms and software
    architectures used to execute queries. The sections were pretty much
    self-contained, but also filled with references to the 20+ years of
    work that the paper tries to summarize. As we also saw in class, many
    techniques discussed here are fundamentally based on nested loops,
    sorting, and hashing, adapting these simple ideas to handle a wide
    array of problems while minimizing the speed and I/O cost. For some
    reason, I failed to see the beauty or elegance behind many techniques
    - to me it seemed like big hacks to make things work reasonably fast,
    and all the math with computing I/O cost seemed very messy.

    Aside from the sorting and hashing sections, some remaining sections
    that I found interesting were the discussion of the iterator model,
    Bloom filters, and complex query plans. Operators that support three
    simple iterator functions simplify coordination of plan execution: we
    get cheap communication, factoring out complexity of overall execution
    plan when designing operators, entire plan can execute in a single
    process, and items don't have to wait in temporary files or buffers.
    Bloom filters, one of the more elegant ideas to come out of hashing,
    can reduce communication effort in a distributed system executing a
    join. The first input is hashed into individual bits in a bit vector;
    the second input's items can only participate in the join if they hash
    to a bit that's set in this vector. This vector can be distributed to
    the scanning sites of the input, and a large fraction of probes can be
    eliminated before ever going to the network, saving bandwidth. In
    complex query execution, multiple operators execute concurrently and
    must share the memory and disk bandwidth. The issues of optimal
    scheduling and resource allocation here are very similar to similar
    issues in operating systems. Stop points allow switching between
    subplans when no intermediate information is in memory, just like OSes
    switch among processes, though there the OS will do all the work of
    saving state. In another section on disk accesses, it was interesting
    to learn that there were so many extensions to the basic structure
    of B-trees besides the B+-trees.

    Overall, this was a very thorough survey of query execution techniques
    with a straightforward presentation. There were even some performance
    comparisons given for the algorithms, which helped in understanding
    their relative merits. In the join section, the paper mentions that
    most of commercial database systems at the time used only nested loops
    and merge-join, because one of the two often gets very close to the
    best performance. I wonder if this is still true today, and in
    general, what do today's high-end DBMSs implement as compared to the
    algorithms surveyed here?


  • Next message: Michael Gubanov: "QEval"

    This archive was generated by hypermail 2.1.6 : Wed May 19 2004 - 11:48:24 PDT