The schedule is available here.
A large portion of your grade in 544 consists of a final project. This project is meant to be a piece of independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a problem you are familiar with, or be a database-related project in your research area. You can either define your own project (e.g. it can be the data management component of your research project), or you can choose one of the project suggestions below, and possibly adapt it. You will work in groups of 1-3 members.
Deadline
|
Milestone |
|
Jan. 14th |
M1: Form groups |
|
Jan 28th |
M2: Propopsals due |
Drop Box |
Feb 18th |
M3: Project Milestone due |
Drop Box |
March 16th |
M4 Project presentations and Final Project Due |
Drop Box |
March 18th |
M5: Final Project Due |
Drop Box |
M1: Groups |
Send email to all course instructors (Jingjing, Jennifer, and Magda):
|
---|---|
M2: Proposal |
Your proposal should be about 1 page in length. Suggested content:
|
M3: Milestone report |
Start from the project proposal, and extend it to a report of about 3-4 pages in length. Suggested content:
|
M4: Presentations/Posters |
Details to be announced. |
M5: Final report |
Start from the project milestone, and extend it to a 6-7 pages report. Suggestion for the additional content:
|
You have two options: you can start from one of the project ideas listed below, or you can define your own database project that is related to research in your main research area.
Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research or that they involve a significant engineering effort. To help you determine if your idea is appropriate and of reasonable scope, we will provide you feedback throughout the semester.
Here are examples of successful projects from the previous offerings of this course.
Note. It is OK if several teams choose the same project; quite likely different teams will come up with quite different approaches, which will make the entire process even more interesting.
1 | Project with the Myria parallel data management system and service In the database group together with the UW eScience Institute, we recently developed a new parallel big data management system and service called Myria. Myria is a great plafrom for research in data management. There are several open problems that would make great class projects using that system.
|
|
---|---|---|
2 | Big data stream processing The goal of this project is to integrate the S-Store (http://www.vldb.org/pvldb/vol7/p1633-cetintemel.pdf), streaming NewSQL system, with the Myria parallel data management system to enable analysis that combine both streaming transaction data and archival transaction data.
|
|
3 | Parallel data management on heterogeneous hardware Explore the challenges and opportunities related to running a parallel data management system on highly heterogenous machines. These machines can have vastly different amounts of memory, cpu, and network bandwith resources. |
|
4 | System benchmarking The goal of this project is to compare your choice of big data management systems or cloud services on a fixed workload. The project will define the workload and will pick the set of systems to compare.
|
|
5 | Big data processing with applications in astronomy We recently developed a new service called MyMergerTree, which enables the analysis of galactic merger trees from astronomy simulations. The goal of this project will be to make the MyMergerTree service much more efficient. Can all queries be made interactive?
|
|
6 | Social data analytics with applications in orthodontics This project will involve working with a resident in the Department of Orthodontics on analyzing social media posts specifically Twitter and Google Plus by Invisalign, a company that provides a clear appliance to move teeth around. The goal is to determine the essential themes of these posts (mapping the keywords) and potentially identify scientifically correct and incorrect information which is provided by this company directly to patients. The data management challenges will relate to trying different technologies to efficiently solve this specific data management problem. The team would work with the resident to define the problem more precisely in the context of the class project.
|
|
7 | Data cleaning and analytics with applications in health metrics and evaluation The Global Burden of Disease (GBD) Study quantifies factors that affect health and health-risks for the world’s largest populations, a critical component in developing policies and prioritizing responses to health threats. The study is based on rigorous analysis of more than 30,000 data sources obtained from nearly 200 countries around the world, including survey and census data, disease registries, clinical and claims records, demographic data, expenditure data, behavioral data, and environmental data.
Producing the Study is a big data problem. The data volumes are significant, but not overwhelming (low order TBs). But the variety of data sources creates huge challenges. There are 300+ conditions (diseases, injuries, and risk factors) for which disease parameters such as prevalence, mortality, or exposure are estimated, and these estimates are made for 200+ countries/non-country geographic units, for 30+ time periods, for male and females sexes, for 20 age groups. Each data source has its own error characteristics. The data are sufficiently large so that even seemingly simple tasks become computationally difficult. Consider a data cleansing task -- adjusting deaths by country by cause (which come from many unreliable data sources) so that the totals equal deaths by country (which is more reliable data). It is expected that this use case can be handled by an appropriately structured parallel database, but there is some complexity in doing so, especially to achieve response times on the order of a few seconds.
The project is to implement the above use-case in the Myria parallel data management system and explore adding novel functionality in support of data cleansing, error detection in newly added data, and use of machine learning to mine the data. |