Proposed Projects

Below are a few ideas for class projects. This is only a suggested list. You are encouraged to come up with your own ideas. Some of these are more specific than others.

In the projects you are required to:

Storing XML in relational databases

XML is going to be the major model for sharing data on the WWW. In many cases, organizations will need to store and manipulate XML data. Several researchers have proposed approaches to storing XML in relational databases. In each of these proposals, the XML is stored as a set of relations in the database. In a sense, the problem of storing XML in a relational database can be viewed as the problem of selecting a set of views to store in a database. In this case, the result of the view is a table.

Build a system that explores this approach. I.e., pose the problem of storing XML as a problem of view selection in data warehouse design. Show that it is more flexible than other approaches and yields better storage forms.

Background reading:

Large scale query server

Database systems usually optimize for the case of large amounts of data and one or few queries. In modern applications (e.g., portals that provide personal pages), the requirement is that there are many queries. Build a system that accepts a large (and possibly changing) number of queries, and is required to update the results of these queries as data streams in. Several technical problems lie here; identify the key ones and propose solutions to them.

Background reading:

Model Management

Bernstein et al. propose a vision for model management systems. Choose one of the problems they describe, and make a first stab at it.

Background reading:

Type-in databases

One of the key problems with database systems is that they require you to define a schema before you can enter any data. Build a system (possibly on top of an existing DBMS) in which you can start typing in your data immediately. The system should help you in organizing your data as you go along, but in a non-intrusive fashion.

This is fresh territory, with little to no background reading. There is a limit of two people working on this.

Description logics and databases

Description logics are knowledge representation formalisms that provide inference capabilities over sets of objects. However, the systems built in the knowledge representation community do not scale up to large number of objects (that has never been the focus). Build a system that provides some of the inferences in a description logic system, but scales up to many objects. This requires you to make a compromise on the expressive power of the logic.

Background reading:

MySQL or PostgresQL

MySQL and PostgresQL are freely available systems for processing SQL queries (including source code). Both have many limitations though. Choose one of these systems and resolve a few of its limitations. Show experimentally that your extensions are robust and scale up.

Some example limitations:

Web sites:

XML Search engine

Build a system that crawls the web for XML files, stores them locally and enables interesting queries on them.

Background reading:

Answering queries using views for XML-QL

The problem of answering queries using views is to reformulate a query posed over a database schema to refer to a set of precomputed views. The problem has been considered for many variants of SQL, but very little for XML query languages. Develop an algorithm for answering queries using views for XML-QL, implement it and show its scalability.

Background reading:

Build a Database management system

Many people have built database management systems, but have you? Design and build such a system from the ground up. This should include a buffer manager, an optimizer, and a runtime component. Clearly the level of sophistication achievable in a single quarter is limited, so there should be an emphasis on a clean, scalable and robust design, rather than on building a commercially competitive system.

Background reading:

Fuzzy Joins for Dirty Data

With the increasing availability and use of data integration systems we are seeing more and more situations where traditional exact-match or range joins do not work. Papers such as William Cohen's in Sigmod 1998 on integration using "Textual Similarity" are starting to appear, with some useful suggestions for ways to deal with some of these issues.

Traditionally DBMSs have had a relatively limited number of join algorithms to choose from. One algorithm would be chosen over another simply based on the cost estimates; the basic comparisons are indentical, its just the order in which they are executed which changes. In a data integration environment, the type of join needed is application or domain specific. Joining data from two phone books (the example from Cohen's paper) requires one type of comparison, while joining gene sequences with weighted accuracies requires another.

It doesnt make sense to simply keep adding more and more rigid joins to a traditional system. Rather, traditional systems need to be made more flexible. Design and implement a modification to some existing system (Tukwila, MySQL, PostgresQL) which allows for a wide variety of join techniques to be employed. Explore the impacts such a modification would have on optimization and runtime.

Background reading:

Transactions and Recovery

There are several interesting projects in this domain, depending on particular interest. Some ideas include:

Background reading: