CSE 544 Class Project Instructions and Suggestions
Final presentations and final reports
Schedule for final presentations
Grading guidelines for final presentations
Final reports
Overview
A large portion (45%) of your grade in 544 consists of a final
project. This project is meant to be a substantial independent
research or engineering effort related to material we have studied in
class. Your project may involve a comparison of systems we have read
about, an application of database techniques to a problem you are
familiar with, or be a database-related project in your research area.
This document describes what is expected of a final project and
proposes some possible project ideas.
What is expected
Good class projects can vary dramatically in complexity, scope, and
topic. The only requirement is that they be related to something we
have studied in this class and that they contain some element of
research or that they involve a significant
engineering effort. In the latter case, there should still be an
element of novelty in your work (otherwise, there is not much point in
doing it). To help you determine if your idea is appropriate and of
reasonable scope, we will arrange to meet with each group several
times throughout the semester.
Project schedule:
- October 1st 2007: Project teams formed: You should
have formed your 2-person team and emailed us the names of both partners.
- Week of October 1st 2007: Please schedule an appointment
with Prof. Balazinska to discuss the project you are planning on undertaking (either
at the end of this week or at the beginning of next week).
- October 12th 2006: Project proposal is due.
We will email you feedback on your project proposals within a few days.
- Week of November 5th 2007: Please schedule an appointment
with Prof. Balazinska to discuss your project.
- November 12th 2007: Project milestone report is due.
We will email you feedback within a few days.
- Week of November 26th 2006: Please schedule an appointment
with Prof. Balazinska to discuss your project.
- December 5th 2006: Final
project report is due and project presentation. This is a hard deadline. We will not be able
to accommodate any extensions.
Hand-in details
As part of the project, you need to hand-in the following three documents.
Project proposals
Format: up-to 2 pages in length, 2-columns, 10pt font,
single-spaced. You can use one of many standard conference proceedings
formats.
The proposal should include
- The description of the problem you are trying to solve. The
motivation for this problem: why is this an important problem?
- The approach you plan to take to solve this problem.
- Your plan for this quarter. What do you plan to achieve by what date.
- A sketch of how you plan to evaluate your solution
- A list of at least two papers related to your problem.
- A list of resources you will need that you do not already have.
Project milestone
Format: up-to 4 pages in length, 2-columns, 10pt font, single-spaced
The
report should include:
- The description of the problem you are trying to solve.
- The description of the approach you are taking to solve the problem.
- A description of related work. How does your approach
compare with what has been done before. You must cite at least 3 papers.
- A description of what you have accomplished so far and
any problems or unexpected issues that you encountered.
- A plan for the rest of the quarter. This plan must include
the experiments you will conduct to evaluate your solution.
Project final report
Format: up-to 8 pages in length, 2-columns, 10pt font, single-spaced
You
should model the content of your report on the papers we were reading this
quarter. The report should include at least the following information:
- The motivation and description of the problem you solved.
- The description of your solution, its merits, and its limitations.
Provide both a short overview and a more detailed description of
each main component of your solution.
- An overview of related work. How does your approach
compare with what has been done before.
- An evaluation section showing the functionality, properties, and performance
of your scheme.
- Your conclusions.
Project ideas
The following is a list of possible project ideas; you are not
required to choose from this list.
When you look for related work, the main database conferences are:
SIGMOD/PODS,
VLDB, ICDE, and CIDR. The main database
journals are TODS, VLDBJ,
DEB, and
SIGMOD Record.
Theme 1. Moirae system: combining live data streams with stream data archives.
Overview: At the University of Washington, we are
building an exciting new type of data management system that combines
continuous, live stream processing with the processing of stream data archives.
The goal of the system, called Moirae, is to improve the quality of
real-time observations about a monitored environment by enhancing that
information with historical data. The following set of three projects
will investigate, design, implement, and evaluate different components
of the Moirae system. The following papers provide background for all
three projects below.
References
- Project 1.1: Access method for data streams. An
important component of Moirae is an access method for effectively
writing data streams to disk and reading them from disk. Rather than
storing and retrieving data sequentially, this access method should
prioritize different data chunks and provide more effective access to
higher priority data. The goal of this project is to investigate and
implement such an access method. The project should include a clear
evaluation of the performance of the access method. This project will
involve significant C/C++ programming and low-level hacking with
PostgreSQL.
Additional references
- Project 1.2: Partitioned stream archive
processing. In the Moirae system, the user includes
special Recall operators into stream processing queries
to access historical
data. Because a stream archive is typically large, executing these historical queries on the entire data archive would take
a long time. By the time the result would return, the live event would
have long passed. Instead, Recall operators execute historical
queries incrementally. The key idea is for Recall to split each query
and execute it on one chunk of historical data at the time, from
the most relevant chunk to the least relevant one. As more chunks
of history are processed, the historical data becomes more
accurate. One way to prioritize history is simply by using time:
the recent past is typically more relevant to a new event than
older information. The goal of this project is to extend the
Moirae Recall operator with the ability to execute historical
queries in such a partitioned manner. The project should focus only
on select-project-join-aggregate queries and should measure the
performance of the resulting system. The project can also
investigate different strategies for prioritizing historical data. This project is
flexible in terms of the balance between algorithm design and system implementation. A complete implementation
will involve significant C/C++ programming and working with the
Moirae system, but initial algorithms and experiments can be done
outside of Moirae.
Additional references
- Project 1.3: Scheduling concurrent queries over stream
data archives. A stream processing engine typically
executes multiple queries at the same time. In Moirae, all queries
can include one or more Recall operators that issue requests for
historical data when a new event is detected on the live
streams. Hence, at any given time, there can be many concurrent
requests for historical data. In Moirae, all these historical
requests are executed incrementally: each request is split into
multiple queries over small chunks of historical data. This
project will investigate how best to schedule such concurrent
historical queries in a manner that ensures all live events are
enhanced with at least some historical data and resources are
fairly shared between the different queries. This project can also
investigate the use of user-feedback in query scheduling. Through
this feedback, a user can indicate which live events are more
important than others or which events need more historical data,
etc. This project will only require light programming as the
algorithms and the scheduler can be developed outside of Moirae
and on top of PostgreSQL (using the latter as a black box).
Additional references
Theme 2. Mobile sensor networks.
Overview: Recently, many systems have been built
to process sensor data. These systems, however, typically assume
that sensors are static and that they produce data continuously. In
practice, a broad class of sensors are mobile and may not always be
connected. One concrete example are cell phones that people carry
around with them and use occasionally to take pictures. If a system
tries to use cell phones as streams of pictures for monitoring
purposes, not all parts of the environment will be covered uniformly
or continuously. The system may not have any information about
certain locations for long periods of time. In collaboration with
Intel Research Seattle, we are building a new data management system
for such mobile sensor networks. In this system, users carry mobile
computing devices and a variety of sensors (i.e., temperature
sensors, gyroscopes, cameras, etc.). Users have the ability to
publish their data either continuously or periodically. Other users
have the ability to query the published data. The two projects below
will investigate different aspects of this system.
References
- TinyDB: An Acquisitional Query Processing System for Sensor Networks.
Samuel Madden, Michael Franklin, Joseph Hellerstein, and Wei Hong. In TODS 30(1), 2005.
- CarTel: A Distributed Mobile Sensor Computing System.
Bret Hull, Vladimir Bychkovskiy, Kevin Chen, Michel Goraczko, Eugene Shih, Yang Zhang, Hari Balakrishnan, Samuel Madden. In Proc. of SenSys 2006.
- The BikeNet Mobile Sensing System for Cyclist Experience Mapping
S. B. Eisenman, E. Miluzzo, N. D. Lane, R. A. Peterson, G-S. Ahn, A. T. Campbell. In Proc. of Sensys'07.
- Participatory sensing
J. Burke, D. Estrin, M. Hansen, A. Parker, N. Ramanathan, S. Reddy, M. B. Srivastava. In Proc. of World-Sensor-Web (WSW'2006).
- Implementing Software on Resource-Constrained Mobile Sensors: Experiences with Impala and ZebraNet.
Ting Liu, Christopher Sadler, Pei Zhang, and Margaret Martonosi. In Proc. of MobiSys'04.
- MetroSense Project: People-Centric Sensing at Scale
Shane B. Eisenman, Gahng-Seop Ahn, Nicholas D. Lane, Emiliano Miluzzo, Ronald A. Peterson, Andrew T. Campbell. In Proc. of World-Sensor-Web (WSW'2006):
- Workshop on World-Sensor-Web (WSW'2006): Mobile Device Centric Sensory Networks and Applications.
You may find other papers from this workshop relevant, although the two papers above are the main ones to take a look at.
- Other potentially relevant workshops and conferences:
- Project 2.1. Data model and query language for mobile sensor networks .
This project will try to answer the question of query semantics
in mobile sensor networks. What kinds of queries should the
system support? What should the system return if there is no
information for a location for a long period of time?
More specifically, the project will develop (1) a new data
model, (2) an algebra, and (3) a query language for mobile
sensor networks. The result of the project should include a
clear description of the data model and algebra, a prototype
implementation of the proposed operators, a query parser that
generates simple query plans (no optimization), and a simple
query execution engine. The evaluation should demonstrate the
result of executing a set of sample queries on synthetic
data.
Additional references
- Project 2.2. Distributed system infrastructure for a
mobile sensor data management system. This project will
investigate the overall system design for a data management system
for mobile sensor networks. Given a set of mobile sensors
intermittently producing data and given a set of
static users submitting queries about that data, what should the
overall data management system look like? The goal of the project
is to design an overall system architecture and produce a simple
prototype implementation. The performance of the prototype should
be evaluated using a combination of synthetic sensor data and
sensor data traces that we will provide. In this project, you can
assume that queries are simple selections over the sensor data.
Additional references
- Yang Zhang, Bret Hull, Hari Balakrishnan, Samuel Madden.
ICEDB: Intermittently Connected Continuous Query Processing. In Proceedings of ICDE, 2007.
Focus on the overall system architecture. Don't worry about the details of the queries. Take a close
look at the related work section for more references about prioritizing the data transmitted from the sensors.
- M. H. Ali, Mohamed F. Mokbel, Walid G. Aref. Phenomenon-aware Stream Query Processing.
In Proceedings of MDM 2007. Some of the cited papers might also be useful.
- The following paper gives an overview of the architecture of a distributed stream processing engine. This is not
exactly the same as what you need, but it can still give you some ideas. The relevant sections are Sections 3 and 4.
Mitch Cherniack,, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Ugur Cetintemel, Ying Xing, and Stan Zdonik.
Scalable Distributed Stream Processing. In Proc. of CIDR 2003.
- Franklin et. al.
Design Considerations for High Fan-In Systems: The HiFi Approach. In Proc of CIDR 2005.
- Amol Deshpande, Suman Nath, Phillip B. Gibbons, Srinivasan Seshan.
Cache-and-Query for Wide Area Sensor Databases
In Proc. of ACM SIGMOD 2003.
Theme 3. Networking meets databases.
- Project 3.1. Database for network forensic
analysis. Network intrusion detection systems (NIDSs) can
identify potential network intruders in real-time. The problem is
that they produce many false positives. For this reason, an
administrator must perform a forensic analysis on each alert to
determine if it is a real threat or a false alarm. In case of a
real threat, the administrator must investigate the root cause of
the problem. To support these types of analysis, an NIDS must log
the network activity in a historical database. Today, most NIDSs
use special-purpose databases. They believe that custom solutions
are necessary to achieve the high input-rate and fast query
execution required by their application. The goal of this project
is to build a historical database for network flow information
using a standard relational database. The key idea is to use
various forms of data partitioning techniques to achieve the
required performance both during data data insert and query
execution.
References
Theme 4. RFID data management.
Overview: In the Paul Allen Center, we have
deployed an RFID-based infrastructure that allows us to track the
movements of equipment and people in the building. This is done as
part of the RFID
Ecosystem project. The deployment currently consists of over 100
antennas spread over floors 2 through 5 of the building.
The goal of the project is to provide useful services such as
alerting people when they forget their things or help them find the
current location of the book they lent someone a few weeks earlier.
- Project 4.1. Complex event processing on
streams. Operators in a stream processing engine are
analogous to relational operators: join, aggregate, select,
etc. However, in many monitoring applications, in particular
RFID-based systems, users are interested in detecting and
processing complex sequences of events. For example, a sequence of
events of the form: "Detected person P at location A, detected
person P at location B, and detected person P at location A again"
might mean that person P just bought a cup of coffee and went back to her office. Extracting
these types of events from streams requires new types of
operators. The challenge is even greater when users want to
express complex events not directly on the raw data but on
previously extracted events. The goal of this project is to extend
a stream processing engine with the capability to identify and
extract such complex hierarchies of events.
This project will involve significant C/C++ programming and
working with both the Borealis stream processing engine and the
Cayuga publish/subscribe system.
References
- Towards Expressive Publish/Subscribe Systems. Alan Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, and Walker White. EDBT 2006.
- High-performance complex event processing over streams. Eugene Wu, Yanlei Diao, Shariq Rizvi. SIGMOD 2006.
- Composite Events for Active Databases: Semantics, Contexts and Detection. Sharma Chakravarthy, V. Krishnaprasad, Eman Anwar,
and S.-K. Kim. VLDB 1994.
- On the Semantics of Complex Events in Active Database Management Systems. D. Zimmer. ICDE 1999.
- Abadi et al. Aurora: A New Model and Architecture for Data Stream Management. VLDB Journal 2003. Primarily sections 2 and 5.
- Abadi et al. The Design of the Borealis Stream Processing Engine. Second Biennial Conference on Innovative Data Systems Research (CIDR 2005).
- Project 4.2. RFID security and privacy.
Another important problem with RFID deployments is their threat to
user privacy. The goal of this project is to investigate
techniques for ensuring privacy and security of the RFID data
while offering useful services. This problem is very broad. For your project, you
should define and investigate a more specific problem. Possible
interesting questions include: (a) What access
control rules or mechanisms should the system provide to enable a good
balance between privacy and system utility? (b) How can one
anonymize RFID data in a manner that protects users' privacy but
enables other researchers to use the anonymized traces for their own
experiments and studies? (c) A user can decide at any time
to stop participating in the RFID Ecosystem and remove their data
from the system. How should the system track and remove all the
data derived from the user's raw RFID tag reads. etc.
References
Theme 5. Other.
There are many other possible projects for the class. Below are pointers to a couple of db workshops and conferences that may help you find inspiration (the
list below is not exhaustive):