Overview of query optimization

From: Alexander Moshchuk (anm@cs.washington.edu)
Date: Mon May 24 2004 - 10:35:29 PDT

Next message: Michael Gubanov: "Qpt"

Previous message: Ankur Jain: "Review 9"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

The paper summarizes the work on query optimization in relational
databases and presents directions for future work. Most of the ideas
are only introduced and covered without many details, making the paper
pretty easy to read.

The paper reviews the System R's optimizer in a much easier to read
fashion and reaffirms its significance. It then focuses on three
critical components of searching for efficient executions: the search
space consisting of execution plans, the cost estimation technique to
determine which plans are cheapest, and the enumeration algorithm to
search through the execution space.

The search space depends on the set of transformations preserving
equivalence of original query tree. Among these are pulling joins
away from outerjoins, pulling group-by down or doing it in stages, and
unfolding a multi-block SQL query into a single block one.

Cost estimation techniques include the use of histograms to obtain
information on data distribution in a column. For describing
multi-column correlations, usually only summary information is used.
In large databases, sampling can be used to estimate the required
statistics. It is important to propagate statistical information
through operators.

It's a good idea to make enumeration algorithms extensible - they need
to adapt to changes in search space, such as new operators or
transformations. The paper describes two extensible optimizers -
Starburst and Volcano/Cascades. The latter was interesting because it
used memoization in its dynamic programming-based search algorithm.

In some sense, the last section on advanced and open issues was the
most interesting one. Distributed databases must figure in
communication costs into their formulas and deal with an expanded
search space due to mobile data and needing to choose remote sites.
Parallel databases aim to reduce the response time of queries by
splitting the work among processors, though they must also address the
additional communication and scheduling decisions. Determining cost
models for user-defined predicates remains a difficult problem, as is
the use of caching of results of queries in the optimizer (though the
latter seems especially useful in optimization). I wonder whether the
state of these optimizations has changed today.

Overall, the paper gives a nice overview of query optimization and is
good as an introduction to the area.

Next message: Michael Gubanov: "Qpt"

Previous message: Ankur Jain: "Review 9"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon May 24 2004 - 10:35:35 PDT