CSE 544: Project

Overview

A large portion of your grade in 544 consists of a final project. This project is meant to be a piece of independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a problem you are familiar with, or be a database-related project in your research area. You can either define your own project (e.g. it can be the data management component of your research project), or you can choose one of the project suggestions below, and possibly adapt it. You will work in groups of 1-3 members.

Project Schedule

Deliverables

M1: Groups	Send email to all course instructors (Jingjing, Jennifer, and Magda): The team members' names A ranked list of projects that you are considering tentatively Your email should cc all your team members.
M2: Proposal	Your proposal should be about 1 page in length. Suggested content: Project title, team members A short description of the project A short list of references (at least 1 paper, but definitely not more than 5) Important: what tools, datasets, and systems you are planning to use.
M3: Milestone report	Start from the project proposal, and extend it to a report of about 3-4 pages in length. Suggested content: The description of the problem you are trying to solve The description of your approach A description of related work. How does your approach compare with what has been done before. Suggestion: cite about 3 papers. Don't worry about finding all the existing related work. Our worry here is not the novelty of your project. The goal is for you to learn that placing your work in perspective of what has been done before is a crucial step in research. For the purpose of the class, we expect your related work to be incomplete. A description of what you have accomplished so far and any problems or unexpected issues that you encountered. A brief plan for the rest of the quarter. This plan must include the experiments you will conduct to evaluate your solution.
M4: Presentations/Posters	Details to be announced.
M5: Final report	Start from the project milestone, and extend it to a 6-7 pages report. Suggestion for the additional content: Improve and expand the presentation of your approach Improve and expand the presentation of the related work, by giving ore technical detail on how your approach is similar to, or different from related work Include an evaluation section, with graphs. Conclusion

Choosing a project

You have two options: you can start from one of the project ideas listed below, or you can define your own database project that is related to research in your main research area.

What is expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research or that they involve a significant engineering effort. To help you determine if your idea is appropriate and of reasonable scope, we will provide you feedback throughout the semester.

Example projects

Here are examples of successful projects from the previous offerings of this course.

Suggested Projects;

Note. It is OK if several teams choose the same project; quite likely different teams will come up with quite different approaches, which will make the entire process even more interesting.

1		Project with the Myria parallel data management system and service In the database group together with the UW eScience Institute, we recently developed a new parallel big data management system and service called Myria. Myria is a great plafrom for research in data management. There are several open problems that would make great class projects using that system. Query scheduler: Design, implement, and evaluate a query scheduler in Myria. Query scheduling remains a very hard problem. One of the challenges is that users today are frustrated by the difficulty to predict how long their queries will take on a shared big data system. The goal is to make query times more predictable to users. If the service is lightly loaded, my query should take the same time as it usually does. If the system is heavily loaded, can we compute and post a clear slowdown factor for all queries in the system? How can we enforce that slowdown factor? Developer languages for big data systems: In Myria, users write high-level declarative scripts in the MyriaL language. Developers can also directly submit query plans in JSON to run experiments in the system. JSON query plans let developers have more control over the query that is generated and how that query is executed. The goal of this project is to extend MyriaL to create a language that is declarative but supports annotations that put constraints on the query plan that should be generated. This is similar to optimizer hints but the goal is not performance. The goal is to let developers run experiments that control where operators are scheduled, the number of parallel instances of each operation, etc. without having to manually write JSON scripts. Query time estimation: Build the next-generation query time predictor for big data queries. Can we build models to predict query times but also models that predict the sensitivity of a query to cardinality estimation errors and changes in runtime conditions? Guaranteed query time execution: Can we predict and guarantee query times in big data systems? How can we leverage the elasticity features of the cloud to achieve this goal? User-defined functions: Extend Myria with support for user-defined functions. What would be an easy and powerful method to let users extend a parallel data management system with extra operators that perform specialized, domain-specific operations? What if we wanted to add support for a library of existing functions? Statistics and cost-based esimation: Add statistics to Myria and an initial cost-based optimizer. Or explore the overhead and benefits of collecting statistics opportunistically during query execution. Text data: Extend Myria with support for text data processing. Multi-system analytics: Extend Myria to run on top of the YARN resource manager. Benchmark the performance of a cluster and individual services when using YARN to run multiple systems at the same time in the same cluster such as GraphLab, Spark, and Myria. Operation indepndent performance explanations: The goal of this project is to build new query performance debugging tools without showing any information about the system internals. If the user does not know anything about query plans and parallel processing, can we still help the user understand why some queries take much longer to process than others? Can we illustrate performance differences by showing a variety of similar SQL queries and their performances?
2		Big data stream processing The goal of this project is to integrate the S-Store (http://www.vldb.org/pvldb/vol7/p1633-cetintemel.pdf), streaming NewSQL system, with the Myria parallel data management system to enable analysis that combine both streaming transaction data and archival transaction data.
3		Parallel data management on heterogeneous hardware Explore the challenges and opportunities related to running a parallel data management system on highly heterogenous machines. These machines can have vastly different amounts of memory, cpu, and network bandwith resources.
4		System benchmarking The goal of this project is to compare your choice of big data management systems or cloud services on a fixed workload. The project will define the workload and will pick the set of systems to compare.
5		Big data processing with applications in astronomy We recently developed a new service called MyMergerTree, which enables the analysis of galactic merger trees from astronomy simulations. The goal of this project will be to make the MyMergerTree service much more efficient. Can all queries be made interactive?
6		Social data analytics with applications in orthodontics This project will involve working with a resident in the Department of Orthodontics on analyzing social media posts specifically Twitter and Google Plus by Invisalign, a company that provides a clear appliance to move teeth around. The goal is to determine the essential themes of these posts (mapping the keywords) and potentially identify scientifically correct and incorrect information which is provided by this company directly to patients. The data management challenges will relate to trying different technologies to efficiently solve this specific data management problem. The team would work with the resident to define the problem more precisely in the context of the class project.
7		Data cleaning and analytics with applications in health metrics and evaluation The Global Burden of Disease (GBD) Study quantifies factors that affect health and health-risks for the world’s largest populations, a critical component in developing policies and prioritizing responses to health threats. The study is based on rigorous analysis of more than 30,000 data sources obtained from nearly 200 countries around the world, including survey and census data, disease registries, clinical and claims records, demographic data, expenditure data, behavioral data, and environmental data. Producing the Study is a big data problem. The data volumes are significant, but not overwhelming (low order TBs). But the variety of data sources creates huge challenges. There are 300+ conditions (diseases, injuries, and risk factors) for which disease parameters such as prevalence, mortality, or exposure are estimated, and these estimates are made for 200+ countries/non-country geographic units, for 30+ time periods, for male and females sexes, for 20 age groups. Each data source has its own error characteristics. The data are sufficiently large so that even seemingly simple tasks become computationally difficult. Consider a data cleansing task -- adjusting deaths by country by cause (which come from many unreliable data sources) so that the totals equal deaths by country (which is more reliable data). It is expected that this use case can be handled by an appropriately structured parallel database, but there is some complexity in doing so, especially to achieve response times on the order of a few seconds. The project is to implement the above use-case in the Myria parallel data management system and explore adding novel functionality in support of data cleansing, error detection in newly added data, and use of machine learning to mine the data.

Project with the Myria parallel data management system and service

In the database group together with the UW eScience Institute, we recently developed a new parallel big data management system and service called Myria. Myria is a great plafrom for research in data management. There are several open problems that would make great class projects using that system.

Query scheduler: Design, implement, and evaluate a query scheduler in Myria. Query scheduling remains a very hard problem. One of the challenges is that users today are frustrated by the difficulty to predict how long their queries will take on a shared big data system. The goal is to make query times more predictable to users. If the service is lightly loaded, my query should take the same time as it usually does. If the system is heavily loaded, can we compute and post a clear slowdown factor for all queries in the system? How can we enforce that slowdown factor?
Developer languages for big data systems: In Myria, users write high-level declarative scripts in the MyriaL language. Developers can also directly submit query plans in JSON to run experiments in the system. JSON query plans let developers have more control over the query that is generated and how that query is executed. The goal of this project is to extend MyriaL to create a language that is declarative but supports annotations that put constraints on the query plan that should be generated. This is similar to optimizer hints but the goal is not performance. The goal is to let developers run experiments that control where operators are scheduled, the number of parallel instances of each operation, etc. without having to manually write JSON scripts.
Query time estimation: Build the next-generation query time predictor for big data queries. Can we build models to predict query times but also models that predict the sensitivity of a query to cardinality estimation errors and changes in runtime conditions?
Guaranteed query time execution: Can we predict and guarantee query times in big data systems? How can we leverage the elasticity features of the cloud to achieve this goal?
User-defined functions: Extend Myria with support for user-defined functions. What would be an easy and powerful method to let users extend a parallel data management system with extra operators that perform specialized, domain-specific operations? What if we wanted to add support for a library of existing functions?
Statistics and cost-based esimation: Add statistics to Myria and an initial cost-based optimizer. Or explore the overhead and benefits of collecting statistics opportunistically during query execution.
Text data: Extend Myria with support for text data processing.
Multi-system analytics: Extend Myria to run on top of the YARN resource manager. Benchmark the performance of a cluster and individual services when using YARN to run multiple systems at the same time in the same cluster such as GraphLab, Spark, and Myria.
Operation indepndent performance explanations: The goal of this project is to build new query performance debugging tools without showing any information about the system internals. If the user does not know anything about query plans and parallel processing, can we still help the user understand why some queries take much longer to process than others? Can we illustrate performance differences by showing a variety of similar SQL queries and their performances?

Big data stream processing

The goal of this project is to integrate the S-Store (http://www.vldb.org/pvldb/vol7/p1633-cetintemel.pdf), streaming NewSQL system, with the Myria parallel data management system to enable analysis that combine both streaming transaction data and archival transaction data.

Parallel data management on heterogeneous hardware

Explore the challenges and opportunities related to running a parallel data management system on highly heterogenous machines. These machines can have vastly different amounts of memory, cpu, and network bandwith resources.

System benchmarking

The goal of this project is to compare your choice of big data management systems or cloud services on a fixed workload. The project will define the workload and will pick the set of systems to compare.

Big data processing with applications in astronomy

We recently developed a new service called MyMergerTree, which enables the analysis of galactic merger trees from astronomy simulations. The goal of this project will be to make the MyMergerTree service much more efficient. Can all queries be made interactive?

Social data analytics with applications in orthodontics

This project will involve working with a resident in the Department of Orthodontics on analyzing social media posts specifically Twitter and Google Plus by Invisalign, a company that provides a clear appliance to move teeth around. The goal is to determine the essential themes of these posts (mapping the keywords) and potentially identify scientifically correct and incorrect information which is provided by this company directly to patients. The data management challenges will relate to trying different technologies to efficiently solve this specific data management problem. The team would work with the resident to define the problem more precisely in the context of the class project.

Data cleaning and analytics with applications in health metrics and evaluation

The Global Burden of Disease (GBD) Study quantifies factors that affect health and health-risks for the world’s largest populations, a critical component in developing policies and prioritizing responses to health threats. The study is based on rigorous analysis of more than 30,000 data sources obtained from nearly 200 countries around the world, including survey and census data, disease registries, clinical and claims records, demographic data, expenditure data, behavioral data, and environmental data.

Producing the Study is a big data problem. The data volumes are significant, but not overwhelming (low order TBs). But the variety of data sources creates huge challenges. There are 300+ conditions (diseases, injuries, and risk factors) for which disease parameters such as prevalence, mortality, or exposure are estimated, and these estimates are made for 200+ countries/non-country geographic units, for 30+ time periods, for male and females sexes, for 20 age groups. Each data source has its own error characteristics.

The data are sufficiently large so that even seemingly simple tasks become computationally difficult. Consider a data cleansing task -- adjusting deaths by country by cause (which come from many unreliable data sources) so that the totals equal deaths by country (which is more reliable data). It is expected that this use case can be handled by an appropriately structured parallel database, but there is some complexity in doing so, especially to achieve response times on the order of a few seconds.

The project is to implement the above use-case in the Myria parallel data management system and explore adding novel functionality in support of data cleansing, error detection in newly added data, and use of machine learning to mine the data.

Deadline	Milestone
Jan. 14th	M1: Form groups	email
Jan 28th	M2: Propopsals due	Drop Box
Feb 18th	M3: Project Milestone due	Drop Box
March 16th	M4 Project presentations and Final Project Due	Drop Box
March 18th	M5: Final Project Due	Drop Box