From: Alexander Moshchuk (anm@cs.washington.edu)
Date: Mon May 24 2004 - 10:28:10 PDT
The paper describes how System R implements query optimization. I
guess System R is to DBMS what Unix is to OS; the basic architecture
and many ideas, especially concerning the optimizer described here,
can still be found in today's systems.
The paper focuses on choosing access paths (i.e. finding good plans)
for efficient query evaluation. This optimization phase comes after
parsing and is followed by the actual code generation and execution
steps. The authors note that optimization doesn't introduce
significant overhead compared to other stages. The paper presents
several important optimization-related concepts, including the cost
model, use of dynamic programming, and the concept of interesting orders.
The cost model in System R relies on several components, namely on (1)
maintaining statistics about relations and indexes such as number of
pages in a relation, (2) using these statistics to estimate
selectivity of predicates, and (3) formulas to estimate the actual
execution cost in terms of both I/O cost (pages fetched) and CPU
utilization. In the latter, there is a weighing factor W in the
formula to supposedly adjust between CPU and I/O cost, though the
authors don't specify how to pick W. It would have been nice to at
least know what they have picked for their configuration. I'm
guessing in general, the CPU utilization cost should not be as
relevant in the cost formula today as it was back then, seeing how CPU
speeds have been outpacing disk speeds in huge proportions.
As we've seen and read in the other paper, System R introduced the
dynamic programming approach for doing a sequence of joins, telling
how to organize the search for the best plan in terms of finding the
best join order for successively larger subsets of tables. A
heuristic based on join predicates is used to reduce the permutations
which are considered. Only two types of joins are considered - nested
loops and merge-join. System R also considers "interesting orders",
meaning that at each stage in the algorithm, in addition to the
lowest-cost option, it also identifies and considers options that lead
to tuple orderings which could reduce the cost of later stages in the
execution plan.
Overall, the optimization strategies were presented well, though I
would have liked to see some kind of performance measurements to
compare various optimization strategies (and unoptimized case). It's
always nice to be able to quantitatively evaluate the benefits of
anything that describes new optimization methods.
This archive was generated by hypermail 2.1.6 : Mon May 24 2004 - 10:28:15 PDT