CSE 544 Class Project Instructions and Suggestions

Final project presentation schedule

Date: Wednesday, December 6th, 9:30am to 11:45am

Location: CSE 403

Please try to attend as many talks as you can.

Time	Team
9:30	Travis Kriplean and Evan Welbourne
Assessing Fine Grained Access Control Techniques for Peer-Based Privacy Concerns in the RFID Ecosystem
9:45	Roxana Geambasu and Tanya Bragin
Exploiting History in a Network Intrusion Detection System
10:00	Bao Nguyen and Thach Nguyen
Selective Sharing of Personal Files
10:15	Abhay Kumar Jha
Rewriting queries to handle Self-Joins in MystiQ
10:30	Natalie Linnell and Sandra Fan
Searching Classroom Lecture Videos Augmented with Slides and Ink
10:45	Ben Lerner and Stephen Friedman
Identifying Musical Themes using Streaming Databases
11:00	Zeinab Abbassi
A Weblog Recommender System Based on Random Walks and Spectral Methods
11:15	Brian DeRenzi and Yaw Anokwa
Flakey: An Archival System for Unreliable Sensor Streams in Stream Processing Engines

Overview

A large portion (35%) of your grade in 544 consists of a final project. This project is meant to be a substantial independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a system you are familiar with, or be a database-related project in your research area.

This document describes what is expected of a final project and proposes some possible project ideas.

What is expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research -- e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.

Project schedule:

October 4th 2006: Project teams formed: You should have formed your 2-person team and emailed us the names of both partners.
Week of October 9th 2006: Please schedule an appointment with Prof. Balazinska to discuss the project you are planning on undertaking.
October 18th 2006: Project proposal is due.
Week of October 23rd 2006: Feedback on your proposal by email.
Week of November 6th 2006: Please schedule an appointment with Prof. Balazinska to discuss your project.
November 8th 2006: Project milestone report.
Week of November 13th 2006: Feedback on your report by email.
Week of November 27th 2006: Please schedule an appointment with Prof. Balazinska to discuss your project.
December 4th 2006: Final project report due.
December 6th 2006: 15-minute project presentations.
Please refer grading guidelines to prepare your presentation.

Hand-in details

As part of the project, you need to hand-in the following three documents.

Project proposals

Format: up-to 2 pages in length, 2-columns, 10pt font, single-spaced. You can use one of many standard conference proceedings formats.

The proposal should include

The description of the problem you are trying to solve. The motivation for this problem: why is this an important problem?
The approach you plan to take to solve this problem.
Your plan for this quarter. What do you plan to achieve by what date.
A sketch of how you plan to evaluate your solution
A list of at least two papers related to your problem.
A list of resources you will need that you do not already have.

Project milestone

Format: up-to 4 pages in length, 2-columns, 10pt font, single-spaced

The report should include:

The description of the problem you are trying to solve.
The description of the approach you are taking to solve the problem.
A description of related work. How does your approach compare with what has been done before. You must cite at least 3 papers.
A description of what you have accomplished so far and any problems or unexpected issues that you encountered.
A plan for the rest of the quarter. This plan must include the experiments you will conduct to evaluate your solution.

Project final report

Format: up-to 8 pages in length, 2-columns, 10pt font, single-spaced

You should model the content of your report on the papers we were reading this quarter. The report should include at least the following information:

The motivation and description of the problem you solved.
The description of your solution, its merits, and its limitations. Provide both a short overview and a more detailed description of each main component of your solution.
An overview of related work. How does your approach compare with what has been done before.
An evaluation section showing the functionality, properties, and performance of your scheme.
Your conclusions

Project ideas

The following is a list of possible project ideas; you are not required to choose from this list -- in fact, we encourage you to try to solve a problem of your own choosing! If you are interested in working on one of these projects, contact Prof. Balazinska.

When you look for related work, the main database conferences are: SIGMOD/PODS, VLDB, ICDE, and CIDR. The main database journals are TODS, VLDBJ, DEB, and SIGMOD Record.

Data management for mobile sensors (cell phones). Stream processing engines (SPEs) are data management systems for sensor data. SPEs assume that sensors are static and produce data continuously. In practice, however, a broad class of sensors are mobile and may not always be connected. One concrete example are cell phones that people carry around with them and use occasionally to take pictures. If a system tries to use cell phones as streams of pictures for monitoring purposes, not all parts of the environment will be covered uniformly or continuously. The system may not have any information about certain locations for long periods of time. This project will try to answer the question of query semantics in these types of environments, where sensors produce data intermittently. What kinds of queries can the system support? What should the system return if there is no information for a location for a long period of time? The project will also investigate how to extend the Borealis stream processing engine to support intermittently connected data sources. Resources: We will provide you a couple of cell phones and the Borealis stream processing engine.

References
- TinyDB: An Acquisitional Query Processing System for Sensor Networks. Samuel Madden, Michael Franklin, Joseph Hellerstein, and Wei Hong. In TODS 30(1), 2005.
- CarTel: A Distributed Mobile Sensor Computing System. Bret Hull, Vladimir Bychkovskiy, Kevin Chen, Michel Goraczko, Eugene Shih, Yang Zhang, Hari Balakrishnan, Samuel Madden. In Proc. of SenSys 2006.
- Workshop on World-Sensor-Web (WSW'2006): Mobile Device Centric Sensory Networks and Applications
- MobiDE 2006. Fifth International ACM Workshop on Data Engineering for Wireless and Mobile Access and earlier versions of this workshop
Exploiting history in a network intrusion detection system. Network intrusion detection systems can identify potential network intruders in real-time. The problem is that they produce many false positives. For this reason, an administrator must regularly examine the alerts that are produced. For each potential intruder, the administrator must study the historical activity of the intruder on the network to determine if the alert is a real threat or not. Because the stream of network information is high rate, the raw data has barely time to be stored on disk and is thus not indexed. As a result, extracting the historical activity of a potential intruder takes a long time. The goal of this project is to study techniques to speed-up these queries by either building partial indexes on the raw data when an intrusion is detected, or materializing the historical information for each intrusion before the user requests it, or using some combination of these techniques. Resources: We will provide you with anonymized network connections traces to use as input data.

References

The case for partial indexes. M Stonebraker. ACM SIGMOD Record. Volume 18 , Issue 4. December 1989.
Partial Indexing for Nonuniform Data Distributions in Relational DBMS's. C. Sartori. M. R. Scalas. IEEE Transactions on Knowledge and Data Engineering. Volume 6 , Issue 3 (June 1994).
Generalized Partial Indexes. P. Seshadri and A. N. Swami. ICDE 1995.
Optimizing queries using materialized views: a practical, scalable solution. Jonathan Goldstein and Per-Åke Larson. SIGMOD 2001

Complex event processing on streams. Operators in a stream processing engine are analogous to relational operators: join, aggregate, select, etc. However, in many monitoring applications, users are interested in detecting and processing complex sequences of events. For example, a sequence of events of the form: "Detected person P at location A, detected person P at location B, and detected person P at location A again" might mean that person P just bought a cup of coffee. Extracting these types of events from streams requires new types of operators. The challenge is even greater when users want to express complex events not directly on the raw data but on previously extracted events. The goal of this project is to extend a stream processing engine with the capability to identify and extract such complex hierarchies of events.

References
- High-performance complex event processing over streams. Eugene Wu, Yanlei Diao, Shariq Rizvi. SIGMOD 2006.
- Composite Events for Active Databases: Semantics, Contexts and Detection. Sharma Chakravarthy, V. Krishnaprasad, Eman Anwar,
  and S.-K. Kim. VLDB 1994.
- On the Semantics of Complex Events in Active Database Management Systems. D. Zimmer. ICDE 1999.
- Towards Expressive Publish/Subscribe Systems. Alan Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, and Walker White. EDBT 2006.
RFID security and privacy. In the Paul Allen Center, we are currently deploying an RFID-based infrastructure that will allow us to track the movements of equipment and people in the building. This is done as part of the RFID Ecosystem project. We already have a 3-floor deployment and several weeks of data. The goal of the project is to provide useful services such as alerting people when they forget their things or help them find the current location of the book they lent someone a few weeks earlier. The problem is that building such an infrastructure raises several privacy and security concerns. The goal of this project is to investigate techniques for ensuring privacy and security of the RFID data while offering useful services.

References
- Two can keep a secret: a distributed architecture for secure database services. Aggarwal et. al. CIDR 2005
- k-Anonymity: a model for protecting privacy. L. Sweeney. (there are a few relevant publications listed on this page)
- Towards Robustness in Query Auditing. S. Nabar et. al. VLDB 2006.
Historical statistical indexes. In a system that monitors an environment continuously, data archives can quickly grow to gigabytes or even terabytes in size. Aggregate queries over such large collections are often needed but are slow to execute. To speed-up these queries, one approach is to build a summary structure such as a histogram or a wavelet and use that synopses for answering queries. We observe, however, that input data often follows a distribution and events cause that distribution to change. For instance, the load on a server can oscillate lightly around some thresholds until a user starts a resource-intensive task causing the load to suddenly spike. Once the task completes, the load will go back to its previous level. Techniques exist to identify such changes in data distribution automatically. The goal of this project is to exploit information about data distribution and changes in distribution to build not a single summary structure but a sequence of such structures, one for each period of time with a given data distribution. The goal is to use these sets of summary structures to answer historical queries more accurately than with a single global synopses.

References
- Models and issues in data stream systems. Babcock et. al. PODS 2002. (Section 6).
- Detecting Change in Data Streams. Dan Kifer, Shai Ben-David, and Johannes Gehrke. VLDB 2004.
Selective sharing of personal files. The SharedViews system enables users to organize and share their files selectively with other users on the Internet. Executing queries in this environment translates into executing possibly complex distributed queries in a federated system. As users execute queries, the system can cache query results at different nodes. Such caches can improve query execution performance and system availability. The challenge, however, is that each node in a system corresponds to a personal computer. Ideally, each user should be able to set a cap on the amount of resources that serve for processing other user's queries. This may be hard to guarantee because queries are optimized and executed independently of one another. The goal of the project is to investigate protocols that will enable query execution to exploit the presence of caches while maintaining the user-define requirement on resource utilization.

References
- SharedViews technical report.
- XXX