From: Neva Cherniavsky (nchernia@cs.washington.edu)
Date: Mon May 24 2004 - 09:16:33 PDT
This paper is a survey of query optimization techniques. First, the
contributions of System R from IBM is discussed (I looked at this in the
last review). The author then describes the search space for
optimization. He briefly discusses bushy trees and deferring Cartesian
products, which he then says are computationally expensive and hard to
quantify. I wondered if this is still true today; for instance, it seems
like bushy trees still receive a lot of attention. He then discusses
outerjoins and group-bys. There doesn't seem to be a lot of interesting
material here.
The next area of discussion is about "flattening" queries, which will
obviously speed things up by not creating temporary tables that need to be
written and read (in the usual case that they are larger than main
memory). This seems to be an important area of research, but also a very
complex one; there are lots of cases when it is simply too difficult to
flatten the query. It seems like it would be hard to automate this
process, since there are lots of special cases (DISTINCT keyword,
correlated queries, aggregates). A solution is the use of partial views;
it seems from the paper that this is a very recent research area, and
sounds promising (though there is a tradeoff between the cost of computing
the views and the use of the views to reduce computation).
The author then describes different ways of determining the cost of
operations. This is very important to database optimizers, because an
estimation of the cost is what determines which actions take place and in
what order. Different statistical methods are described and there is a
brief description of the overall cost computation.
Next, two specific enumeration architectures for extensible optimizers are
discussed, Starburst and Volcano/Cascades. My favoroite was
Volcano/Cascades, because it used algebraic expressions and the top-down
dynamic programming technique sounded neat. However, I would have liked
more detail about both of these systems (if they were to be discussed at
all, more information would have been better).
Lastly, the author discussed some interesting new directions of database
optimizer research, including parallel databases, user-defined function,
and materialized views. Parallelization seems especially useful for
databases, and I'm curious if they've become widespread commercially. I'd
also like to see a study of the quantifiable advantages of parallel vs.
single processor databases - do the advantages of reduced response time
override the disadvantages of processor communication?
This archive was generated by hypermail 2.1.6 : Mon May 24 2004 - 09:16:33 PDT