CSE 544 Class Project Instructions and Suggestions
Final project presentation schedule
Date: Wednesday, December 6th, 9:30am to 11:45am
Location: CSE 403
Please try to attend as many talks as you can.
Time |
Team |
9:30 |
Travis Kriplean and Evan Welbourne |
Assessing Fine Grained Access Control Techniques for Peer-Based Privacy Concerns in the RFID Ecosystem |
9:45 |
Roxana Geambasu and Tanya Bragin |
Exploiting History in a Network Intrusion Detection System |
10:00 |
Bao Nguyen and Thach Nguyen |
Selective Sharing of Personal Files |
10:15 |
Abhay Kumar Jha |
Rewriting queries to handle Self-Joins in MystiQ |
10:30 |
Natalie Linnell and Sandra Fan |
Searching Classroom Lecture Videos Augmented with Slides and Ink |
10:45 |
Ben Lerner and Stephen Friedman |
Identifying Musical Themes using Streaming Databases |
11:00 |
Zeinab Abbassi |
A Weblog Recommender System Based on Random Walks and Spectral Methods |
11:15 |
Brian DeRenzi and Yaw Anokwa |
Flakey: An Archival System for Unreliable Sensor Streams in Stream Processing Engines |
Overview
A large portion (35%) of your grade in 544 consists of a final
project. This project is meant to be a substantial independent
research or engineering effort related to material we have studied in
class. Your project may involve a comparison of systems we have read
about, an application of database techniques to a system you are
familiar with, or be a database-related project in your research area.
This document describes what is expected of a final project and
proposes some possible project ideas.
What is expected
Good class projects can vary dramatically in complexity, scope, and
topic. The only requirement is that they be related to something we
have studied in this class and that they contain some element of
research -- e.g., that you do more than simply engineer a piece of
software that someone else has described or architected. To help you
determine if your idea is of reasonable scope, we will arrange to meet
with each group several times throughout the semester.
Project schedule:
- October 4th 2006: Project teams formed: You should
have formed your 2-person team and emailed us the names of both partners.
- Week of October 9th 2006: Please schedule an appointment
with Prof. Balazinska to discuss the project you are planning on undertaking.
- October 18th 2006: Project proposal is due.
- Week of October 23rd 2006: Feedback on your proposal by email.
- Week of November 6th 2006: Please schedule an appointment
with Prof. Balazinska to discuss your project.
- November 8th 2006: Project milestone report.
- Week of November 13th 2006: Feedback on your report by email.
- Week of November 27th 2006: Please schedule an appointment
with Prof. Balazinska to discuss your project.
- December 4th 2006: Final project report due.
- December 6th 2006: 15-minute project presentations.
Please refer grading guidelines to prepare your presentation.
Hand-in details
As part of the project, you need to hand-in the following three documents.
Project proposals
Format: up-to 2 pages in length, 2-columns, 10pt font, single-spaced. You can use one of many standard conference proceedings formats.
The
proposal should include
- The description of the problem you are trying to solve. The
motivation for this problem: why is this an important problem?
- The approach you plan to take to solve this problem.
- Your plan for this quarter. What do you plan to achieve by what date.
- A sketch of how you plan to evaluate your solution
- A list of at least two papers related to your problem.
- A list of resources you will need that you do not already have.
Project milestone
Format: up-to 4 pages in length, 2-columns, 10pt font, single-spaced
The
report should include:
- The description of the problem you are trying to solve.
- The description of the approach you are taking to solve the problem.
- A description of related work. How does your approach
compare with what has been done before. You must cite at least 3 papers.
- A description of what you have accomplished so far and
any problems or unexpected issues that you encountered.
- A plan for the rest of the quarter. This plan must include
the experiments you will conduct to evaluate your solution.
Project final report
Format: up-to 8 pages in length, 2-columns, 10pt font, single-spaced
You
should model the content of your report on the papers we were reading this
quarter. The report should include at least the following information:
- The motivation and description of the problem you solved.
- The description of your solution, its merits, and its limitations.
Provide both a short overview and a more detailed description of
each main component of your solution.
- An overview of related work. How does your approach
compare with what has been done before.
- An evaluation section showing the functionality, properties, and performance
of your scheme.
- Your conclusions
Project ideas
The following is a list of possible project ideas; you are not
required to choose from this list -- in fact, we encourage you to try
to solve a problem of your own choosing! If you are interested in
working on one of these projects, contact Prof. Balazinska.
When you look for related work, the main database conferences are: SIGMOD/PODS, VLDB, ICDE, and CIDR. The main database journals are TODS, VLDBJ, DEB, and SIGMOD Record.
-
Data management for mobile sensors (cell
phones). Stream processing engines (SPEs) are data management
systems for sensor data. SPEs assume that
sensors are static and produce data continuously. In
practice, however, a broad class of sensors are
mobile and may not always be connected. One concrete example are
cell phones that people carry around with them and use
occasionally to take pictures. If a system tries to use cell phones
as streams of pictures for monitoring purposes, not all parts of the
environment will be covered uniformly or continuously. The system may not
have any information about certain locations for long periods of time.
This project will try to answer the question of query semantics in these types of
environments, where sensors produce data intermittently. What kinds of queries
can the system support? What should the system return if there is no
information for a location for a long period of time? The project will also
investigate how to extend the Borealis stream processing engine
to support intermittently connected data sources. Resources:
We will provide you a couple of cell phones and the
Borealis stream processing engine.
References
- TinyDB: An Acquisitional Query Processing System for Sensor Networks. Samuel Madden, Michael Franklin, Joseph Hellerstein, and Wei Hong. In TODS 30(1), 2005.
- CarTel: A Distributed Mobile Sensor Computing System. Bret Hull, Vladimir Bychkovskiy, Kevin Chen, Michel Goraczko, Eugene Shih, Yang Zhang, Hari Balakrishnan, Samuel Madden. In Proc. of SenSys 2006.
- Workshop on World-Sensor-Web (WSW'2006): Mobile Device Centric Sensory Networks and Applications
- MobiDE 2006. Fifth International ACM Workshop on Data Engineering for Wireless and Mobile Access and earlier versions of this workshop
- Exploiting history in a network intrusion detection
system. Network intrusion detection systems can identify
potential network intruders in real-time. The problem is that they produce many false
positives. For this reason, an administrator must regularly examine the
alerts that are produced. For each potential intruder, the administrator
must study the historical activity of the intruder on the network
to determine if the alert is a real threat or not. Because the stream of network
information is high rate, the raw data has barely time to be stored on disk
and is thus not indexed. As a result, extracting the historical activity of a
potential intruder
takes a long time. The goal of this project is to study
techniques to speed-up these queries by either building partial
indexes on the raw data when an intrusion is detected, or materializing the historical information for each intrusion before the
user requests it, or using some combination of these techniques. Resources: We will provide you with anonymized network connections
traces to use as input data.
References
- Complex event processing on streams. Operators in a stream processing engine are analogous to relational
operators: join, aggregate, select, etc. However, in many monitoring
applications, users are interested in detecting and processing
complex sequences of events. For example, a sequence of events of the form:
"Detected person P at location A, detected person P at location B, and
detected person P at location A again" might mean that person P just
bought a cup of coffee. Extracting these types of events from streams
requires new types of operators. The challenge is even greater when users want to express complex events not directly on the raw data but on previously extracted events. The goal of this project is to extend a stream processing engine with the
capability to identify and extract such complex hierarchies of events.
References
- High-performance complex event processing over streams. Eugene Wu, Yanlei Diao, Shariq Rizvi. SIGMOD 2006.
- Composite Events for Active Databases: Semantics, Contexts and Detection. Sharma Chakravarthy, V. Krishnaprasad, Eman Anwar,
and S.-K. Kim. VLDB 1994.
- On the Semantics of Complex Events in Active Database Management Systems. D. Zimmer. ICDE 1999.
- Towards Expressive Publish/Subscribe Systems. Alan Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, and Walker White. EDBT 2006.
- RFID security and privacy. In the Paul Allen Center, we are currently deploying an RFID-based infrastructure that will allow us to track the movements of equipment and people in the building. This is done as part of the RFID Ecosystem project. We already have a 3-floor deployment and several weeks of data. The goal of the project is to provide useful services such as alerting people when they forget their things or help them find the current location of the book they lent someone a few weeks earlier. The problem is that building such an infrastructure
raises several privacy and security concerns. The goal of this project is to investigate techniques for ensuring privacy and security of the RFID data while offering useful services.
References
- Historical statistical indexes. In a system that monitors an environment continuously, data archives can quickly grow to gigabytes or even terabytes in size. Aggregate queries over such large collections are often needed but are slow to execute. To speed-up these queries, one approach is to build a summary structure such as a histogram or a wavelet and use that synopses for answering queries. We observe, however, that input data often follows a distribution and events cause that distribution to change. For instance, the load on a server can oscillate lightly around some thresholds until a user starts a resource-intensive task causing the load to suddenly spike. Once the task completes, the load will go back to its previous level. Techniques exist to identify such changes in data distribution automatically. The goal of this project is to exploit information about data distribution and changes in distribution to build not a single summary structure but a sequence of such structures, one for each period of time with a given data distribution. The goal is to use these sets of summary structures to answer historical queries more accurately than with a single global synopses.
References
- Selective sharing of personal files. The SharedViews system enables users to organize and share their files selectively with other users on the Internet.
Executing queries in this environment translates into executing possibly complex distributed queries in a federated system. As users execute queries, the system can cache query results at different nodes. Such caches can improve query execution performance and system availability. The challenge, however, is that each node in a system corresponds to a personal computer. Ideally, each user should be able to set a cap on the amount of resources that serve for processing other user's queries. This may be hard to guarantee because queries are optimized and executed independently of one another. The goal of the project is to investigate protocols that will enable query execution to exploit the presence of caches while maintaining the user-define requirement on resource utilization.
References