From: Alexander Moshchuk (anm@cs.washington.edu)
Date: Wed May 19 2004 - 11:48:50 PDT
This paper presents a survey of efficient algorithms and software
architectures used to execute queries. The sections were pretty much
self-contained, but also filled with references to the 20+ years of
work that the paper tries to summarize. As we also saw in class, many
techniques discussed here are fundamentally based on nested loops,
sorting, and hashing, adapting these simple ideas to handle a wide
array of problems while minimizing the speed and I/O cost. For some
reason, I failed to see the beauty or elegance behind many techniques
- to me it seemed like big hacks to make things work reasonably fast,
and all the math with computing I/O cost seemed very messy.
Aside from the sorting and hashing sections, some remaining sections
that I found interesting were the discussion of the iterator model,
Bloom filters, and complex query plans. Operators that support three
simple iterator functions simplify coordination of plan execution: we
get cheap communication, factoring out complexity of overall execution
plan when designing operators, entire plan can execute in a single
process, and items don't have to wait in temporary files or buffers.
Bloom filters, one of the more elegant ideas to come out of hashing,
can reduce communication effort in a distributed system executing a
join. The first input is hashed into individual bits in a bit vector;
the second input's items can only participate in the join if they hash
to a bit that's set in this vector. This vector can be distributed to
the scanning sites of the input, and a large fraction of probes can be
eliminated before ever going to the network, saving bandwidth. In
complex query execution, multiple operators execute concurrently and
must share the memory and disk bandwidth. The issues of optimal
scheduling and resource allocation here are very similar to similar
issues in operating systems. Stop points allow switching between
subplans when no intermediate information is in memory, just like OSes
switch among processes, though there the OS will do all the work of
saving state. In another section on disk accesses, it was interesting
to learn that there were so many extensions to the basic structure
of B-trees besides the B+-trees.
Overall, this was a very thorough survey of query execution techniques
with a straightforward presentation. There were even some performance
comparisons given for the algorithms, which helped in understanding
their relative merits. In the join section, the paper mentions that
most of commercial database systems at the time used only nested loops
and merge-join, because one of the two often gets very close to the
best performance. I wonder if this is still true today, and in
general, what do today's high-end DBMSs implement as compared to the
algorithms surveyed here?
This archive was generated by hypermail 2.1.6 : Wed May 19 2004 - 11:48:24 PDT