Proposed Projects

Query Optimization

Before you get started, it is recommended you read the following papers to get a better feel for query optimization and the Tukwila system:

J. Ullman, J. Widom, H. Garcia-Molina, Database System Implementation, Chapter 7, Query Optimization.
S. Chaudhuri, An Overview of Query Optimization in Relational Systems, PODS 1998.
Z. Ives, D. Florescu, M. Friedman, A. Levy, D. S. Weld, An Adaptive Query Execution System for Data Integration, SIGMOD 99.

In constructing a query optimizer, you will need to choose a basic architecture from the three possible methods. The first two methods are a basic framework that you will construct to investigate a particular area of interest (from the choices below); the third will be a stand-alone project.

Dynamic programming

The dynamic programming, or "bottom-up" approach to query optimization was first described in the Selinger et al paper on System-R in 1979. Most modern database systems use some variation of this approach, which constructs "optimal" plans by combining "optimal" subplans using dynamic programming.

Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price: Access Path Selection in a Relational Database Management System. SIGMOD 1979: 23-34
Laura M. Haas, Donald Kossmann, Edward L. Wimmers, Jun Yang: Optimizing Queries Across Diverse Data Sources. VLDB 1997: 276-285

Top-down

The top-down query optimization model uses some sort of search algorithm (e.g. best-first) and various pruning heuristics to find the "optimal" plan.

Craig A. Knoblock. Planning, Executing, Sensing, and Replanning for Information Gathering. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montreal, Canada, 1995.

Transformational: See below for more details. This optimizer architecture is sufficiently complex that it will count as a project on its own.

Optimization Project Areas of Interest

Choosing fragmentation points for performance

The Tukwila system supports interleaving of planning and execution, which can result in significant performance improvements if any of the estimated statistics (cardinalities, selectivities, etc.) are wrong. There are, however, trade-offs: interleaving requires materialization of results to disk at the end of each fragment, and this increases time-to-first-tuple. We would like to find the optimal breakdown here, perhaps in relation to confidence in our statistics.

Luc Bouganim, Olga Kapitskaia, Patrick Valduriez: Memory-Adaptive Scheduling for Large Query Execution. CIKM 1998: 105-115
Tolga Urhan, Michael J. Franklin, Laurent Amsaleg: Cost Based Query Scrambling for Initial Delays. SIGMOD Conference 1998: 130-141

Collectors

Tukwila includes a collector operator, which works like a union of some dynamic number of sources. The collector includes a number of children, each of which has an identical schema; it uses a set of rules to enable or disable some of these children based on current conditions. This project would involve using data source coverage and expected data transfer rate information to choose a policy for contacting and switching among data sources.

Ramana Yerneni, Yannis Papakonstantinou, Serge Abiteboul, Hector Garcia-Molina: Fusion Queries over Internet Databases. EDBT 1998: 57-71

Plans as graphs: In general, query execution plans are tree shaped, where a given node in the graph feeds its data to a single parent. There are cases where a graph-shaped plan, where a single node feeds multiple parents, is perhaps more efficient. Tukwila supports graph plans through materialization points; the same output data can be read in multiple places. This project would involve creating an optimizer which takes the costs of graph-structured plans into effect, as well as standard tree-structured plans.

Time to first tuple: Most conventional database systems make their cost estimates based on total running time. For an online query answering system, this is probably undesirable - instead we would like to minimize the time to produce a first answer. For this project, you would attempt to develop a cost model that primarily considers this factor, perhaps even trading away some of the total running time.

Transformational Optimizer

For this style of optimizer, various rules are applied to transform the query plan in a particular way (e.g. to "push down" a selection, to apply join commutativity or associativity, etc.). If the transformation results in a lower cost, the new plan is selected.

Goetz Graefe, William J. McKenna: The Volcano Optimizer Generator: Extensibility and Efficient Search. ICDE 1993: 209-218
Jose Luis Ambite and Craig A. Knoblock: Planning by Rewriting: Efficiently Generating High-Quality Plans. AAAI 1997: 706-713.
L.M. Haas, J.C. Freytag, G.M. Lohman, H. Pirahesh: Extensible Query Processing in Starburst. SIGMOD 1989: 377-388. (Not really a transformational optimizer at the same level; does transformations on the logical plan and then dynamic programming on the physical plan.)

Operator memory allocation, join type selection: An issue that often arises in query optimization is determining how much memory each join operator will need - this, of course, depends on which join algorithm was chosen. Using some sort of selectivity estimate, plus estimated cardinality, we want to find an optimal set of operators and memory allocations.

Algorithms and Performance Analysis

Optimal fan-out

A very important, and apparently little-understood, factor in hash-based join algorithms is related to choosing how many hash buckets to use (and/or how many overflow files to create, aka the "fan-out") for the hash join algorithm; we would like to minimize the number of overflows if the join exceeds memory. There seems to be a dearth of literature in this area. This project would include a method for deciding on an optimal fan-out value based on estimated cardinality and a given memory size, for both the hybrid hash join and the double pipelined join; and would also include a performance study of the algorithm.

Goetz Graefe: Query Evaluation Techniques for Large Databases. Computing Surveys 25(2): 73-170 (1993)

Other

You are free to propose your own project in a different area.