Proposed Projects
Query Optimization
Before you get started, it is recommended you read the following papers
to get a better feel for query optimization and the Tukwila system:
- J. Ullman, J. Widom, H. Garcia-Molina, Database System Implementation, Chapter 7, Query Optimization.
- S. Chaudhuri, An Overview of Query Optimization in Relational Systems, PODS 1998.
- Z. Ives, D. Florescu, M. Friedman, A. Levy, D. S. Weld, An Adaptive Query Execution System for Data Integration, SIGMOD 99.
In constructing a query optimizer, you will need to choose a basic
architecture from the three possible methods. The first two methods
are a basic framework that you will construct to investigate a
particular area of interest (from the choices below); the third will
be a stand-alone project.
- Dynamic programming
- The dynamic programming,
or "bottom-up" approach to query optimization was first described in
the Selinger et al paper on System-R in 1979. Most modern database
systems use some variation of this approach, which constructs
"optimal" plans by combining "optimal" subplans using dynamic
programming.
- Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin,
Raymond A. Lorie, Thomas G. Price: Access Path Selection in a
Relational Database Management System. SIGMOD 1979: 23-34
- Laura M. Haas, Donald Kossmann, Edward L. Wimmers, Jun Yang: Optimizing Queries Across Diverse Data Sources. VLDB 1997: 276-285
- Top-down
- The top-down query optimization model
uses some sort of search algorithm (e.g. best-first) and various
pruning heuristics to find the "optimal" plan.
- Transformational
- See below for more details. This
optimizer architecture is sufficiently complex that it will count as a project
on its own.
Optimization Project Areas of Interest
- Choosing fragmentation points for performance
- The
Tukwila system supports interleaving of planning and execution, which
can result in significant performance improvements if any of the
estimated statistics (cardinalities, selectivities, etc.) are wrong.
There are, however, trade-offs: interleaving requires materialization
of results to disk at the end of each fragment, and this increases
time-to-first-tuple. We would like to find the optimal breakdown
here, perhaps in relation to confidence in our statistics.
- Collectors
- Tukwila includes a collector operator, which works like a
union of some dynamic number of sources. The collector includes a
number of children, each of which has an identical schema; it uses a
set of rules to enable or disable some of these children based on
current conditions. This project would involve using data source
coverage and expected data transfer rate information to choose a
policy for contacting and switching among data sources.
- Plans as graphs
- In general, query execution plans are tree shaped, where a given
node in the graph feeds its data to a single parent. There are cases
where a graph-shaped plan, where a single node feeds multiple parents,
is perhaps more efficient. Tukwila supports graph plans through
materialization points; the same output data can be read in multiple
places. This project would involve creating an optimizer which takes
the costs of graph-structured plans into effect, as well as standard
tree-structured plans.
- Time to first tuple
- Most conventional database systems make their cost estimates based
on total running time. For an online query answering system, this is
probably undesirable - instead we would like to minimize the time to
produce a first answer. For this project, you would attempt to develop
a cost model that primarily considers this factor, perhaps even trading
away some of the total running time.
- Transformational Optimizer
- For this style of
optimizer, various rules are applied to transform the query plan in a
particular way (e.g. to "push down" a selection, to apply join
commutativity or associativity, etc.). If the transformation results
in a lower cost, the new plan is selected.
- Operator memory allocation, join type selection
- An issue that often arises in query optimization is determining how
much memory each join operator will need - this, of course, depends
on which join algorithm was chosen. Using some sort of selectivity
estimate, plus estimated cardinality, we want to find an optimal
set of operators and memory allocations.
Algorithms and Performance Analysis
- Optimal fan-out
- A very important, and
apparently little-understood, factor in hash-based join algorithms is
related to choosing how many hash buckets to use (and/or how many
overflow files to create, aka the "fan-out") for the hash join
algorithm; we would like to minimize the number of overflows if the
join exceeds memory. There seems to be a dearth of literature in this
area. This project would include a method for deciding on an optimal
fan-out value based on estimated cardinality and a given memory size,
for both the hybrid hash join and the double pipelined join; and would
also include a performance study of the algorithm.
- Goetz Graefe: Query Evaluation Techniques for Large
Databases. Computing Surveys 25(2): 73-170 (1993)
Other
You are free to propose your own project in a different area.