Overview of query optimization

From: Alexander Moshchuk (anm@cs.washington.edu)
Date: Mon May 24 2004 - 10:35:29 PDT

  • Next message: Michael Gubanov: "Qpt"

    The paper summarizes the work on query optimization in relational
    databases and presents directions for future work. Most of the ideas
    are only introduced and covered without many details, making the paper
    pretty easy to read.

    The paper reviews the System R's optimizer in a much easier to read
    fashion and reaffirms its significance. It then focuses on three
    critical components of searching for efficient executions: the search
    space consisting of execution plans, the cost estimation technique to
    determine which plans are cheapest, and the enumeration algorithm to
    search through the execution space.

    The search space depends on the set of transformations preserving
    equivalence of original query tree. Among these are pulling joins
    away from outerjoins, pulling group-by down or doing it in stages, and
    unfolding a multi-block SQL query into a single block one.

    Cost estimation techniques include the use of histograms to obtain
    information on data distribution in a column. For describing
    multi-column correlations, usually only summary information is used.
    In large databases, sampling can be used to estimate the required
    statistics. It is important to propagate statistical information
    through operators.

    It's a good idea to make enumeration algorithms extensible - they need
    to adapt to changes in search space, such as new operators or
    transformations. The paper describes two extensible optimizers -
    Starburst and Volcano/Cascades. The latter was interesting because it
    used memoization in its dynamic programming-based search algorithm.

    In some sense, the last section on advanced and open issues was the
    most interesting one. Distributed databases must figure in
    communication costs into their formulas and deal with an expanded
    search space due to mobile data and needing to choose remote sites.
    Parallel databases aim to reduce the response time of queries by
    splitting the work among processors, though they must also address the
    additional communication and scheduling decisions. Determining cost
    models for user-defined predicates remains a difficult problem, as is
    the use of caching of results of queries in the optimizer (though the
    latter seems especially useful in optimization). I wonder whether the
    state of these optimizations has changed today.

    Overall, the paper gives a nice overview of query optimization and is
    good as an introduction to the area.


  • Next message: Michael Gubanov: "Qpt"

    This archive was generated by hypermail 2.1.6 : Mon May 24 2004 - 10:35:35 PDT