CSE 544 Class Project Instructions and Suggestions

Final presentations

Grading guidelines for final presentations

Overview

A large portion (35%) of your grade in 544 consists of a final project. This project is meant to be a substantial independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a problem you are familiar with, or be a database-related project in your research area.

This document describes what is expected of a final project and proposes some possible project ideas.

What is expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research or that they involve a significant engineering effort. In the latter case, there should still be an element of novelty in your work (otherwise, there is not much point in doing it). To help you determine if your idea is appropriate and of reasonable scope, we will arrange to meet with each group several times throughout the semester.

Example projects

Here are some examples of successful past projects:

"Relational Data Markets" by Ben Birnbaum and Alex Jaffe.
"Machine Learning for Automatic Physical DBMS Tuning" by Ivan Beschastnikh and Andrew Guillory.
"An Image-based Positioning System" by Ankit Gupta, Rahul Garg, and Ryan Kaminsky.
"Data Partitioning and Indexing for Network Forensic Analysis" by Cherie Cheung Jue Wang
"Knowledge-Guided Schema Discovery from Semi-Structured Text" by Thomas Lin.

Project schedule:

January 19th 2009: Project teams formed: You should have formed your 2 to 4-person team and emailed us the names of all partners.
Week of January 26th 2009 : Please schedule an appointment with Prof. Balazinska to discuss the project you are planning on undertaking.
February 2nd 2009 : Project proposal is due. We will email you feedback on your project proposals within a few days.
Week of February 16th 2009 : Please schedule an appointment with Prof. Balazinska to discuss your project.
February 20th 2009 : Project milestone report is due. We will email you feedback within a few days.
EXTENDED: March 16th or 19th 2009 : Project presentation.
EXTENDED: March 20th 2009 : Final project report is due.

Hand-in details

As part of the project, you need to hand-in the following three documents.

Project proposals

Format: up-to 2 pages in length, 2-columns, 10pt font, single-spaced. You can use one of many standard conference proceedings formats.

The proposal should include

The description of the problem you are trying to solve. The motivation for this problem: why is this an important problem?
The approach you plan to take to solve this problem.
Your plan for this quarter. What do you plan to achieve by what date.
A sketch of how you plan to evaluate your solution
A list of at least two papers related to your problem.
A list of resources you will need that you do not already have.

Project milestone

Format: up-to 4 pages in length, 2-columns, 10pt font, single-spaced

The report should include:

The description of the problem you are trying to solve.
The description of the approach you are taking to solve the problem.
A description of related work. How does your approach compare with what has been done before. You must cite at least 3 papers.
A description of what you have accomplished so far and any problems or unexpected issues that you encountered.
A plan for the rest of the quarter. This plan must include the experiments you will conduct to evaluate your solution.

Project final report

Format: up-to 8 pages in length, 2-columns, 10pt font, single-spaced

You should model the content of your report on the papers we were reading this quarter. The report should include at least the following information:

The motivation and description of the problem you solved.
The description of your solution, its merits, and its limitations. Provide both a short overview and a more detailed description of each main component of your solution.
An overview of related work. How does your approach compare with what has been done before.
An evaluation section showing the functionality, properties, and performance of your scheme.
Your conclusions.

Project ideas

The following is a list of possible project ideas; you are not required to choose from this list.

When you look for related work, the main database conferences are: SIGMOD/PODS, VLDB, ICDE, and CIDR. The main database journals are TODS, VLDBJ, DEB, and SIGMOD Record.

Cloud computing: analyzing massive data sets

Reducing the latency of MapReduce DAGs : MapReduce is a programming model and execution environment for processing massively parallel data sets. Several front-ends (Hive, Pig, Sawzall) exist that enable a developer to express their MapReduce computation in the form of a declarative query. Queries are converted into directed acyclic graphs (DAGs) of MapReduce jobs. To avoid re-processing entire queries when failures occur, MapReduce materializes the output of each Map and Reduce task on disk. This approach, however, adds significant overhead. It slows down query execution. In contrast, in parallel databases, data flows directly from one operator to the next. This approach achieves very good performance but, when a failure occurs, the entire query needs to be restarted. The goal of this project is to explore the trade-off between these two extremes. The first step will be to modify Hadoop, an open-source version of MapReduce to enable the choice between either (a) streaming data directly between operators, (b) materializing the output, or (c) using a hybrid technique. The second step will be to evaluate the properties of these different alternatives.
Performance debugging. When a user wants to execute a complex query over a massively parallel data set, he can debug the query logic on his local machine using a small data sample. However, once the user runs the query in a large cluster, if the query exhibits weird performance problems, there is no good way to figure out what is going on. Is an operator outputting more data than expected? Do some input tuples take much longer to process than others? Maybe there simply is bad data in the dataset that causes some machines to fail repeatedly? The goal of this project is to develop a framework to facilitate the monitoring and debugging of runtime query performance in a cloud-computing environment.
Admission control, scheduling, and operator placement. Today's systems for processing large datasets operate in batch mode. All requests go into a queue. The system processes the requests from the queue. When a job finishes, the user can retrieve the results from a file. The goal of this project is to study different admission control, scheduling, and/or operator placement algorithms that either maximize system performance in this batch-oriented world or offer some desirable properties to users such as incremental results for concurrently executing queries, flexible job priorities, etc.
Runtime prediction. The goal of this project is to build a predictor that estimates the completion time for a massively parallel query before the query starts executing. With this tool, when a user submits a query, the system will indicate whether the expected run time for the query is in the order of seconds, minutes, hours, or days. Ideally, the predictor should try to be even more precise. This project will investigate the possibility of building an accurate predictor and will study the factors that affect the predictor's accuracy.

Cloud computing: databases as a service

Warehousing. Many companies have recently started to offer highly-scalable, distributed database services, where applications store their data inside a service provider's data centers and access it by issuing queries. Examples include Amazon SimpleDB, Microsofot SQL Data Services, and the Google App Engine Datastore. To enable scalability, however, query capabilities are currently restricted to selection and ordering. Although sufficient for the daily operations of many web services, such query capabilities are too limited for decision support activities. This project will expand the "database as a service in the cloud" concept by enabling decision support capabilities. Decision support queries are typically resource intensive and thus difficult to support in this highly-scalable, shared, and interactive environment. The goal of the project is to study whether one could execute decision support queries as incremental background tasks in these environments.
Safe data sharing. In databases as a service, multiple web services store their data inside data centers that belong to the same administrative entity. Such data co-location can enable interesting optimizations when multiple web-services are combined together into other web services or mash-ups. However, one has to be careful to maintain the security of the data. The goal of this project is to investigate how to facilitate and leverage such data sharing safely.

Peer-to-peer data management

Seattle is a platform developed in the CSE department at UW for deploying distributed systems. It provides a lightweight sandbox for running arbitrary code written in a subset of Python while constraining the code to use a well defined fraction of the available resources on a machine. Seattle makes it easy to build and deploy distributed services that may run over thousands of nodes spread across the Internet.
- P2P Data Streaming: One of the uses for Seattle is as a measurement platform. Because Seattle nodes are spread throughout the Internet, each node offers a valuable vantage point from which to observe Internet activity such as congestion, routing instabilities, outages, and much more. One way to think about Seattle nodes then is as sensors. The nodes generate a constant stream of data about what they observe from their vantage point. It is then up to the researcher to collect this data and process it in a meaningful manner. The database project for this measurement use case of Seattle involves designing and implementing an infrastructure capable of organizing potentially millions of data streams coming from Seattle nodes in a way that allows the researcher to query the node stream data without needing to aggregate all the data across all time in a centralized manner. This may be done by, for example, delegating stream aggregation to Seattle nodes that have more resources at their disposal, and by constructing a tree structure for Seattle nodes in which the sensing nodes are the leaves of the tree and the researcher is at the root of the tree
- P2P ACID Databases: A Seattle service may use millions of nodes, but it may also use hundreds of just a dozen of nodes. Distributed Hash Table (DHTs) are typically used to provide data storage to systems with thousands to millions of nodes. However, for systems comprised of just a dozen nodes, it is simpler and much more efficient to use a database. Traditional databases, however, do not cope well with peer-to-peer environments (e.g. such as in the Seattle platform). In this project, your aim will be to explore, design, and implement a data store organization that combines the advantages of databases with the resilience of DHTs. You will design a database that supports ACID properties, but that can also thrive in a peer-to-peer settings in which nodes join and leave the network unpredictably and in which nodes have disparate resources at their disposal.

Scientific data management

Over the past several year, sciences have moved from data poor to data rich. Increasingly scientists collect, curate, share, and analyze large data sets. The data volumes and data processing requirements of different sciences raise interesting data management challenges. Some of these challenges include:

Annotations and discussions in a DBMS. The NatureMapping Program asks the public and schools to contribute scientific data (information about animal sightings) for scientists conducting biodiversity studies. The goal of the project is to go beyond data collection and monitoring and actively engage citizens into both understanding and implementing the entire process of science. A national public biodiversity database accessible to scientists and the broader public is being developed. As can be expected with volunteers, the reported data contains a variety of errors. The goal of this project is to investigate how to extend a relational DBMS with support for data annotations and explicit discussions. These features will enable volunteers not only to contribute data but also to discuss and clean it.
Tracking user-sessions across applications. The Incorporated Research Institutions for Seismology (IRIS) runs a large data center for the earth scientists. It holds over 70TB of time series data (seismograms), increasing at a rate of 12TB annually. Scientists access this large repository through a variety of web-based applications. An important challenge is that these applications are all independent. If a user needs to access multiple tools to retrieve related data, the system provides no support for the user to keep track of the queries she issued through the different tools. The goal of this project is to investigate how best to enable some kind of "persistent sessions" for users that would help them keep track of their work across different web-based applications.
Fixed-point or recursive queries in MapReduce . Today's data management systems for processing massively parallel datasets provide effective support for a large class of relational queries. In some fields, such as in astronomy, however, users often need to execute queries that require recursion or iterations until a fixed point is reached. In particular, users often need to run different types of data clustering algorithms. The goal of this project is to investigate how best to provide support for such queries in Hadoop, an open-source version of MapReduce.

Miscellaneous

User-defined functions: Most DBMSs today enable users to write user-defined functions. However, query optimizers often do a poor job optimizing queries that involve such functions because of lack of information about the function properties, such as their cost. In some systems, the user is expected to input these properties manually. The goal of this project is to use static and runtime code analysis to figure out important properties of user-defined functions and feed the information back to the query optimizer in order to improve its performance for queries that use such functions.
How students learn SQL : As part of a research project on query management, we have collected a log of queries from the undergraduate database class. The log contains thousands of queries. The goal of this project is to use this log to study how students learn SQL and how we could improve their experience. For example, why is there an order of magnitude difference between the number of queries that different students submit in the process of working on the assignment? What is the strategy of the students who learn the fastest? etc.

Other.

There are many other possible projects for the class. Below are pointers to a couple of db workshops and conferences that may help you find inspiration (the list below is not exhaustive):

Web and databases: WebDB 2007 and WPRSIUI'07.
Computer vision meets databases: CVDB 2007.
Multi-media databases: MDDM 2007
Databases and information retrieval: DBRank'07 and TDMM 2007
Computer architecture meets databases: DaMon'07.
Theoretical problems in databases: PODS 2007.
Networking meets databases: NetDB'07.