Proposed Projects

Below are a few ideas for class projects. This is only a suggested list. You are encouraged to come up with your own ideas. Some of these are more specific than others.

Groups should be assembled by the end of the second week.
Project proposals should be in by the end of the third week.
Every group should be in touch with me at least once every two weeks.
Projects are due on June 2nd.
Proposals should explain: what problem is being addressed; why previous work is not appropriate; sketch of the approach; plan for validation.

In the projects you are required to:

Define the problem you are addressing.
Explain the related work and why it does not adequately address the problem.
Explain your solution.
Explain how you're going to validate your solution.
Validate.
Explain lessons learned and shortcomings of your approach.

Storing XML in relational databases

XML is going to be the major model for sharing data on the WWW. In many cases, organizations will need to store and manipulate XML data. Several researchers have proposed approaches to storing XML in relational databases. In each of these proposals, the XML is stored as a set of relations in the database. In a sense, the problem of storing XML in a relational database can be viewed as the problem of selecting a set of views to store in a database. In this case, the result of the view is a table.

Build a system that explores this approach. I.e., pose the problem of storing XML as a problem of view selection in data warehouse design. Show that it is more flexible than other approaches and yields better storage forms.

Background reading:

Daniela Florescu, Donald Kossmann: A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database, INRIA Technical Report, May 1999.
Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314.
Dimitri Theodoratos, Timos K. Sellis: Data Warehouse Configuration. VLDB 1997: 126-135
Alin Deutsch, Mary F. Fernandez, Daniela Florescu, Alon Y. Levy, Dan Suciu: Xml-ql: A Query Language for XML. W3C Submission, August 1998.
Alin Deutsch, Mary F. Fernandez, Daniela Florescu, Alon Y. Levy, Dan Suciu: A Query Language for XML. WWW8 / Computer Networks 31(11-16): 1155-1169 (1999).

Large scale query server

Database systems usually optimize for the case of large amounts of data and one or few queries. In modern applications (e.g., portals that provide personal pages), the requirement is that there are many queries. Build a system that accepts a large (and possibly changing) number of queries, and is required to update the results of these queries as data streams in. Several technical problems lie here; identify the key ones and propose solutions to them.

Background reading:

Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang: NiagraCQ: A Scalable Continuous Query System for Internet Databases. To appear, SIGMOD Conference 2000.

Model Management

Bernstein et al. propose a vision for model management systems. Choose one of the problems they describe, and make a first stab at it.

Background reading:

Philip Bernstein, Alon Levy, Rachel Pottinger, submitted to VLDB 2000. See Rachel for a copy.

Type-in databases

One of the key problems with database systems is that they require you to define a schema before you can enter any data. Build a system (possibly on top of an existing DBMS) in which you can start typing in your data immediately. The system should help you in organizing your data as you go along, but in a non-intrusive fashion.

This is fresh territory, with little to no background reading. There is a limit of two people working on this.

Description logics and databases

Description logics are knowledge representation formalisms that provide inference capabilities over sets of objects. However, the systems built in the knowledge representation community do not scale up to large number of objects (that has never been the focus). Build a system that provides some of the inferences in a description logic system, but scales up to many objects. This requires you to make a compromise on the expressive power of the logic.

Background reading:

Ronald J. Brachman, Deborah L. McGuinness, Peter F. Patel-Schneider, Lori Alperin Resnick, and Alex Borgida, ``Living with CLASSIC: When and How to Use a KL-ONE-Like Language,'' in John Sowa, ed., Principles of Semantic Networks: Explorations in the representation of knowledge, Morgan-Kaufmann: San Mateo, California, 1991, pages 401--456. (See Alon for a copy)
CLASSIC
Alexander Borgida: Description Logics in Data Management. TKDE 7(5): 671-682 (1995)

MySQL or PostgresQL

MySQL and PostgresQL are freely available systems for processing SQL queries (including source code). Both have many limitations though. Choose one of these systems and resolve a few of its limitations. Show experimentally that your extensions are robust and scale up.

Some example limitations:

MySQL (TODO List)	PostgresQL (TODO List)
Views	Referential integrity
Transactions	Outer joins
Subqueries
Object-Relational extensions (collection types, etc)

Web sites:

XML Search engine

Build a system that crawls the web for XML files, stores them locally and enables interesting queries on them.

Background reading:

Jeffrey Naughton, David DeWitt, David Maier et. al.. The Niagara Internet Query System. Submitted for publication
goxml.com

Answering queries using views for XML-QL

The problem of answering queries using views is to reformulate a query posed over a database schema to refer to a set of precomputed views. The problem has been considered for many variants of SQL, but very little for XML query languages. Develop an algorithm for answering queries using views for XML-QL, implement it and show its scalability.

Background reading:

Alin Deutsch, Mary F. Fernandez, Daniela Florescu, Alon Y. Levy, Dan Suciu: Xml-ql: A Query Language for XML. W3C Submission, August 1998.
Alon Levy, Answering queries using views: a survey. Submitted for publication 1999.
Yannis Papakonstantinou, Vasilis Vassalos: Query Rewriting for Semistructured Data. SIGMOD Conference 1999: 455-466.
Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi: Rewriting of Regular Expressions and Regular Path Queries. PODS 1999: 194-204.
Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi: Query Answering Using Views for Data Integration over the Web. WebDB (Informal Proceedings) 1999: 73-78.

Build a Database management system

Many people have built database management systems, but have you? Design and build such a system from the ground up. This should include a buffer manager, an optimizer, and a runtime component. Clearly the level of sophistication achievable in a single quarter is limited, so there should be an emphasis on a clean, scalable and robust design, rather than on building a commercially competitive system.

Background reading:

The textbook has sections on each component.

Fuzzy Joins for Dirty Data

With the increasing availability and use of data integration systems we are seeing more and more situations where traditional exact-match or range joins do not work. Papers such as William Cohen's in Sigmod 1998 on integration using "Textual Similarity" are starting to appear, with some useful suggestions for ways to deal with some of these issues.

Traditionally DBMSs have had a relatively limited number of join algorithms to choose from. One algorithm would be chosen over another simply based on the cost estimates; the basic comparisons are indentical, its just the order in which they are executed which changes. In a data integration environment, the type of join needed is application or domain specific. Joining data from two phone books (the example from Cohen's paper) requires one type of comparison, while joining gene sequences with weighted accuracies requires another.

It doesnt make sense to simply keep adding more and more rigid joins to a traditional system. Rather, traditional systems need to be made more flexible. Design and implement a modification to some existing system (Tukwila, MySQL, PostgresQL) which allows for a wide variety of join techniques to be employed. Explore the impacts such a modification would have on optimization and runtime.

Background reading:

William W. Cohen: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity. SIGMOD Conference 1998: 201-212.
The textbook has a good section on System R style optimizers.

Transactions and Recovery

There are several interesting projects in this domain, depending on particular interest. Some ideas include:

Add transactions and an ARIES style recovery subsystem to MySQL or PostgresQL.
Explore transactions (and potentially recovery) in integration systems such as Tukwila. Design and implement based on your findings.

Background reading:

The textbook has a good section on transactions.