From: Neva Cherniavsky (nchernia@cs.washington.edu)
Date: Tue May 18 2004 - 16:23:45 PDT
This paper surveys query evaluation techniques for large databases,
specifically exploring algorithms and architectures of database query
execution engines. In general, the paper looks at complex queries (those
that require a number of different algorithms to work together) on files
too large to fit into main memory.
The sections we cover are section 1, which explores the architecture of
query execution engines; section 2, which discusses sorting and hashing;
section 3, which quantifies the cost of disk accesses for large databases;
section 4, which describes methods for removing duplicates and
implementing aggregate functions; section 5, which discusses algorithms
for the join operation; section 8, which explores execution of very
complex plans with many operators; section 11.1, which is about nested
relations; and section 12.4, which is about bit vector filters.
Section 1 explores the physical algebra of a database query engine, as
opposed to the logical algebra of the database system. The authors note
that it is impossible to tell from reading a logical expression what the
execution time would be, without mapping it to a physical system. They
note that logical operations usually do not map directly to physical ones,
making this process tricky. One solution is to separate the functions for
the operators into prepare, produce, and cleanup. This actually leads to
a quite elegant implementation of a join operator, where the entire query
plan can be executed within a single process and items never have to wait
in a temporary buffer, and the operator is unaffected by the complexity of
the whole plan.
Section 2 details sorting and hashing algorithms in database systems.
Sorting is quite common and merging is the type of sort used. The authors
note that sorting should be implemented the same way as above, i.e. with
prepare, produce, and cleanup procedures. We went over much of what they
discuss here in class. The authors describe a nice algorithm for
producing runs for merge sort that uses a priority heap. Keys are read in
and stored in the heap, then written out in order; the one that's been
written is replaced by a read, so the memory is utilized. When the
smallest key found after reading is smaller than the smallest that has
been written to disk, that key is marked for the next run. When all keys
have been marked for the next run, the next run starts. Though I found
this algorithm quite appealing, apparently there's a trade off in fewer
runs versus the different I/O pattern it causes and the resulting increase
in complexity of memory management.
Hashing seems to me to be a very good way of accomplishing matching tasks
in a database management system. The authors note that hashing is easy
when the entire hash table fits into memory; otherwise, you must deal with
hash table overflow. The two methods they describe for this are avoidance
and resolution, though it seems that hybrid-hash algorithms work the best.
These use all available memory for in-memory processing but are also able
to process large input files. According to the authors, a major problem
with hash-based algorithms is you must have a good hash function; all the
benefits of hashing diminish (considerably) when the hash is skewed.
Section three discusses disk access in database systems, and observes that
read-ahead is a good way for a database to increase efficiency; this
requires contiguous file allocation, which is not supported by UNIX. I
was interested to read that some database designers for UNIX have
essentially implemented their own file system that does read and write
contiguous files, and I wondered if Postgres did this. The authors then
describe data structures, such as the B-trees which we've gone over in
class. They also talk about index only scans and note when they can be
more efficient. I thought the newer systems that implemented a dynamic
index scan (e.g., decided at run time which indices to scan) sounded neat
(though they must be quite complex).
Section 4 describes aggregation and duplicate removal. These are closely
related and are discussed interchangeably. According to the authors, in
most commercial systems aggregation and duplicate removal algorithms are
based on sorting, using temporary files for output. The authors note that
with today's memory improvement, iterators could also be used to exploit
the larger memory size. I wondered if since this survey, any commercial
system had been changed to use iterators. It does seem rather inefficient
to continue using temporary files for this task.
The authors go on to describe nested loops, removing duplicates as early
as possible, and hashing. I thought the second was a nice improvement,
and easy to implement via sorting or hashing. From the graphs, it looks
like hybrid hashing is the best way to do aggregation, though of course it
is highly dependent on how good the hash function is.
Joins are discussed in the next section. There are many techniques,
though the authors note that newer ones are sometimes unable to answer the
question, "Does this new technique apply to joining three inputs without
interrupting data flow between the join operations?" I would like to know
more about this question; is it widely considered an important property of
join techniques? What kinds of techniques fail this test that have been
seriously examined by the database community?
Nested loop join, merge-join, and hash-join algorithms are described; we
talked about these in class. The heap-filter merge join sounds like a
good way to save I/O operations over merge join and be more efficient that
nested join. Again, it was emphasized that skew in the hash function
messes things up for hash join.
Section 8 discusses execution of complex query plans. Optimal scheduling
of multiple operators and the allocation of resources are the main issues.
One way to do this is through the use of "stop points", which allow the
engine to switch between subplans. It is also important to let the engine
decide if it must force a stop point due to overflow. Also, proportional
memory division is suggested as the best solution for multiple operators
being active concurrently. For recursion, it is suggested that the
algorithms should proceed in phases, with one phases completed before
another begins. It is noted that disk bandwidth and disk arms for seeking
are important issues that must be dealt with; I wondered if these issues
had been discussed by the database community beyond this survey. The
authors describe several optimizations that should be made by database
designers, but note that the optimizations are complex and will require
further research.
Nested relations are discussed in section 11.1. I thought nested
relations were quite a natural construct for the database system, though I
can see how it adds complexity to allow this feature in a system.
However, I also thought this was a big difference between XML and
relational DBMS; so if DBMS can support nested queries, it would seem that
the answer would be to use XML as a data transfer language but keep the
data on disk in a RDBMS. Joins over nested queries seem to be the biggest
complication, and one that is not really solved yet (though they do
discuss a system that uses parallel partitioned nested-hashed loops that
sounds promising).
Finally, in section 12.4, bit vector filters are discussed. These filters
are used to decide when a tuple is important to the operation and when it
can be discarded. A hash filter determines if a tuple will be examined;
this way, some tuples that don't need to be looked at are, but no tuples
that should have been looked at are missed. In this way, a large fraction
of the probe tuples can be discarded before incurring network costs.
Also, during recursive partitioning, filters can be used at increasingly
small granularities to remove items from the probe input that do not
contribute to the join result. Bit vectors seem like a very elegant
solution to the low bandwidth problem (better than semi-joins) and I don't
really see any disadvantages, just that they may not always help very
much.
Long paper = long review...
This archive was generated by hypermail 2.1.6 : Tue May 18 2004 - 16:23:46 PDT