CSE 544: Project

A portion of your grade in 544 consists of a final project. This project is meant to be a piece of independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a problem you are familiar with, or be a database-related project in your research area. You can either define your own project (e.g. it can be the data management component of your research project), or you can choose one of the project suggestions below, and possibly adapt it. You will work in groups of 1-3 members.

Project Schedule

Deliverables

M1: Groups	Send email to both Koutris and Suciu with: The team members' names The team name. A ranked list of projects that you are considering tentatively Your email should cc all team members.
M2: Proposal	Your proposal should be about 1 page in length. Suggested content: Project title, team members A short description of the projec A short list of references (at least 1 paper, but definitely not more than 5) Important: what tools, datasets, and systems you are planning to use.
M3: Milestone report	Start from the project proposal, and extend it to a report of about 3-4 pages in length. Suggested content: The description of the problem you are trying to solve The description of your approach A description of related work. How does your approach compare with what has been done before. Suggestion: cite about 3 papers. A description of what you have accomplished so far and any problems or unexpected issues that you encountered. A brief plan for the rest of the quarter. This plan must include the experiments you will conduct to evaluate your solution.
M4: Presentation	Tuesday, May 29, 8:30-12:30 in CSE 405 (the Database Lab). See details below. Presentation order: TBA Format: power point slides, 10 minutes per project What to include in the presentation: a description of the problem: why is it important, why is it non-trivial an overview of prior approaches, and related work your approach your results (theoretical, empirical, experimental) a brief discussion on the significance of the results (do they work ? do they solve the problem you set out to do ? do they improve over existing work ?) conclusions Rule of thumb for the number of slides: maximum 1 slide / minute. That is, for 10’, prepare about 8-9 slides. Use animation only sparingly: they always slow down the presentation, and often have the effect that they give the audience less time to read important parts of the slides. Multi-team projects: no, you do not get extra time! You can either delegate one of you to give the entire presentation, or take turns. Just make sure you stay within 10’. Voting: each student will vote on all other project presentations, and we will establish a ranking of all the votes. Details to follow.
M5: Final report	Start from the project milestone, and extend it to a 6-7 pages report. Suggestion for the additional content: Improve and expand the presentation of your approach Improve and expand the presentation of the related work, by giving ore technical detail on how your approach is similar to, or different from related work Include an evaluation section, with graphs. Conclusions

Choosing a project

You have two options: you can start from one of the project ideas listed below, or you can define your own database project that is related to research in your main research area.

What is expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research or that they involve a significant engineering effort. In the latter case, there should still be an element of novelty in your work (otherwise, there is not much point in doing it). To help you determine if your idea is appropriate and of reasonable scope, we will arrange to meet with each group several times throughout the semester.

Example projects

Here are examples of successful projects from the previous offerings of this course.

Suggested Projects; this list is still being updated and expanded

Note. It is OK if several teams choose the same project; quite likely different teams will come up with quite different approaches, which will make the entire process even more interesting.

1	Data disambiguation	DBLP is a highly curated database. Every entry is checked manually by humans, who perform deduplication, i.e. they check if an author X is the same as author Y. But even DBLP has lots of errors: they often identify X with Y even when X, Y are distinct. Consider Chen Li's entry http://dblp.uni-trier.de/db/indices/a-tree/l/Li:Chen.html. The database researcher is a Professor at UC Irvine, but his DBLP entry has lots of papers that are not his. In this project you will study how to do disambiguation on DBLP. Consider Bogdan Alexe: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/a/Alexe:Bogdan.html There are two with this name: one at USC working on schema mappings (publishes with Wang-Chiew Tan, Koliaitis etc) and one at ETH Zurich working in computer vision (publishes with Vittorio Ferrari, Thomas Desselaers, etc). In this project you will disambiguate authors in DBLP. For example, one approach is to use co-authorships. Chen Li the database researcher publishes papers with some group of coauthors; Chen Li the electrial engineeer publishes with a disjoint set of authors. What I can provide: code for importing the DBLP data into postgres. Deliverable: find a list of ambiguous authors in the DBLP data.
2	Visgres	Say you have a graph in postgres in a table R(source, dest), and you want to compute the histogram of outegrees: select y.cnt, count() as freq from (select x.source, count() as cnt from R group by x.source) y group by y.cnt order by y.cnt The result is boring table with entires of the form (cnt, freq). Instead, users would rather see the result as a graph: histogram, or scatter-plot, or what-have-you. Develop a system that integrates postgres with a graph visualization tool, such as D3 (http://mbostock.github.com/d3/) or Google Charts (http://code.google.com/apis/chart/): call the system visgres. In visgres you will say: #scatterplot (1,1000), (1,1000000,log) select y.cnt, count() as freq from (select x.source, count() as cnt from R group by x.source) y group by y.cnt order by y.cnt and it will produce the scatterplot, where the cnt axis goes from 1 to 1000, and the freq axis is in log scale and goes from 1, 100000. This is an implementation project. Deliverable: a tool that should be easy and fun to use. Ideally, you should modify the posgres source code (it is not too hard), but a middleware solution may also be acceptable.
3	Computing Determinacy for TweeQL Queries on Twitter Data	A query V "determines" a query Q if one can answer Q only from the answer of V, without accessing the database. Determinacy is important in many applications, e.g. in "query answering using views". More recently it has been applied to data pricing: if V "determines" Q then we should set the prices such that Price(Q) <= Price(V). It also helps in computing prices: if the data seller has defined prices for V1, V2, ..., Vk and the buyer wants to buy a query Q, then Price(Q) should be the cheapest price of any set of views that determine Q. In this project you will compute the determinacy relation for queries on Twitter data. The challenge is that queries on twitter data have UDFs, which are impossible to analyze theoretically; instead you will use some form of data mining to infer approximate determinacy. Use the query language TweetQL (See "Processing and visualizing the data in tweets" by Marcus et al, in Sigmod Record 2011), and consider a repository of tweets (say 10k tweets at a minimum). You will use the repository to determine if the TweetQL query V determines the query Q, with high probability.
4	Experimental Evaluation of One-step MapReduce Algorithm for Conjunctive Queries.	This is an experimental evaluation project, requiring the implementation of some TPC-H queries over MapReduce. Consider the join: J(x,y) :- R(x),S(x,y). This query can be computed "efficiently" in 1 MapReduce job, by using a hash join on the variable x, in rather standard fashion. Consider the conjunctive query Q(x,y) :- R(x),S(x,y),T(y). It does not seem to be computable in one single step: one needs two steps, first to compute V(x,y) :- R(x),S(x,y) using a hash join on x, second to compute Q(x,y) :- V(x,y),T(y) using a hash join on y. In fact, Koutris&Suciu have proven that this query requires two parallel steps, under reasonable assumptions. Surprisingly, Afrati and Ullman (in EDBT 2010) gave an algorithm that computes Q in one MR step. In fact, they give a more general algorithm that computes any conjunctive query in one single MR step! This does not contradict K&S, because this algorithm replicates the data more than the K&S model allows. In this project you will implement Afrati and Ullman's algorithm in MR, and will evaluate its performance compared to a more traditional query execution plan. Use the TPC-H data.
5	Data import tool.	Most data is stored in text files, in some ad-hoc formats. This is true for scientific data, Web logs, network logs, etc. Design a simple-to-use tool to import data in a database system. Start from (some small subset of) these data sets: http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public Experiment with a couple of data import: 1. PADS system, and its extension LearnPADS, http://www.padsproj.org/index.html 2. Google's "Refine" tool http://code.google.com/p/google-refine/ 3. Data Wrangler http://vis.stanford.edu/wrangler/ PADS is aimed at developers: it is general and scalable, but programming is required. Refine and Data Wrangler are GUI's: easy to use on simple data sets, but limited. In this project you will use PADS, and design modest extensions that improve its usability, by borrowing some of the functionality in Refine or Data Wrangler.
6	Efficient Price Computation for Query-based Pricing	In query-based pricing, the data seller defines the price of its data by specifying a number of price points: (V1,p1), (V2,p2), ... If a user wants to compute a query over the data, the system needs to compute the price of the data returned by that query. If the query happens to be one of the views Vi, then it's easy, the price is pi. Otherwise, the price is computed by interpolating the views, using the notion of "determinacy". This computation is currently done using a standard max-flow, min-cut algorithm, and is relatively slow: it scales to a few thousand tuples, not much more. In this project you will improve this algorithm to scale to larger databases. Start by optimizing the interface to the max-flow algorithm (which is typically the bottleneck). If that is insufficient, consider specializing the max-flow algorithm to the types of graphs occurring in data pricing. And if that is insufficient or not possible, then consider some approximation algorithm (which you would have to design). This project requires both implementation and algorithm design.
7	Provenance-based access control in data sharing.	Data security in relational databases is based on a simple access control mechanism, which consists of granting permissions (SELECT, UPDATE, ...) to users and/or roles. Extend data security to an environment where users contribute their own data instances, then create views over other users' data. Typical question to ponder: If Alice joins her table with a table belonging to Bob, then gives the result to Carol who deletes all entries that do not occur in a reference dataset owned by Daisy, then what permissions does Erol need in order to read the final view? You will implement access control by using the provenance of the data. You will use the PERM data provenance system, http://permdbms.sourceforge.net/, described here: "Perm: Processing provenance and data on the same data model through query rewriting", by Boris Glavic and Gustavo Alonso, ICDE 2009.
8	Using SAT Solvers for Static Query Analysis	There are several decision problems for conjunctive queries that are NP hard. For example checking if a query Q1 is contained in Q2; or if Q1 and Q2 are equivalent; or rewriting a query Q in terms of some views V1, ..., Vk. The latter problem is especially important in database applications, and there exists some heursitic based algorithms in the database literature for solving it (called the Bucket Algorithm and Minicon respectively). However, in recent years SAT solvers have become incredibly effective: they are easy to setup and use, and scale to tens of thousands of Boolean variables. In this project you will implement some classical static analysis tasks by reducingthem to a SAT problem, then using a SAT solver. You will start with query answering using views, and will compare your implementation to the Bucket Algorithm and to Minicon; time permitting you may extend your implementation to query containment. The challenge in this project is to perform the reduction efficiently: SAT solvers only accept CNF formulas, so you must convert the problem to CNF, and do this efficiently.
9	A project related to astronomy, designed by Magda Balazinska	Scientists are able to generate data at an increasing scale and rate. Their ability to analyze that data is falling behind their ability to generate it. The database group is currently collaborating with several researchers in the astronomy department to help them address this data management challenge. Several of these astronomers need to operate on data in the form of series of sky images together with metadata about the celestial structures that can be found in these images. Existing relational DBMSs are not well-suited for managing image data. The goal of this project will be to take a concrete use-case and data from our astronomy colleagues, implement the use-case in the SciDB parallel multidimensional array processing engine, and measure the resulting performance. Following this performance evaluation, the project is to propose and implement an extension to SciDB (e.g., array indexes, materialized views, physical tunings, etc.) to improve the performance of the workload. Given the initial performance results, we will help the team identify which SciDB extension would be most interesting to design, implement, and evaluate. SciDB is written in C++. Thus familiarity with C++ is a requirement for a successful project.
10	Evaluation of MADLib	Madlib http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-38.pdf is a recently released extension of postgres with in-database analytic methods, providing a suite of SQL-based algorithms for machine learning, data mining and statistics. In this project you will evaluate Madlib both in terms of performance and in terms of usability. Suggested evaluation. Select a smalll number (say three) machine learning methods; implement them in Madlib (you may already find it implemented as a library function) and also in some other ML tool (e.g. Weka http://www.cs.waikato.ac.nz/~ml/weka/); compare the two implementations both in terms of performance and in terms of usability. Draw lessons for the design of future big data analytics engines.

Project Presentations: Tuesday May 29 (new!)

You have 10' to present your project, using powerpoint slides. Please email your powerpoint slides to Paris before the day of your presentation. If you don't like your slot for whatever reason, please ask a colleague to swap. I ask you to attend all presentations.

During the presentations, but no later than the end of the talks on Tuesday, we will all vote on our favorite project, based both on the project's content and on the presentation. The winner receives a diploma and a small gift certificate. Bring your laptop to vote.

Deadline	Milestone
April 1st	M1: Form groups	email
April 22	M2: Propopsals due	Drop Box
May 13	M3: Project Milestone due	Drop Box
May 29	M4: Project presentations Tuesday, May 29, 8:30am-12:30pm, CSE 405 (Database Lab)
June 3	M5: Final Project Due	Drop Box

Day	Time	Project	Members	TTeam name
Tuesday, May 29	8:30-8:40	Querying Player Data in Refraction	Yun-En Liu
	8:40-8:50	Medical Record Databases on a Mobile Device	Dan Butler, Sam Sudar	Cols 'n Rowses
	8:50-9:00	IGO NO JOSEKI	Daniel Poore
	9:00-9:10	Data Disambiguation in DBLP	Ezgi Mercan
	9:10-9:20	DocJoin	Lydia Chilton	"HCI is Core Computer Science"
	9:20-9:30	Collaborative WebDB	Raymond Cheng, Umar Javed, Adam Lerner, Will Scott
	9:30-9:40	Evaluation of MADlib	Mitchell Koch
		BREAK
	10:00-10:10	Investigating Applications of MadLIB to NLP Problems	Mark Yatskar, Nicholas FitzGerald	Timberwolves
	10:10-10:20	Producing an American Sign Language to English Dictionary with Crowdsourcing	Kyle Rector, Supasorn Suwajanakorn	KS
	10:20-10:30	Synthesizing SQL Queries from Input-Output Examples	Sai Zhang, Yuyin Sun
	10:30-10:40	Visgres	Muico Uldarico
	10:40-10:50	Scheduling and Optimization of Iterative Array Operations in Distributed Database System	Jingjing Wang
	10:50-11:00	Collaborative Text Manipulation for Fine-Grained Edit Data Collection	Katie Kuksenok
	11:00-11:10	DBLP Author disambiguation through collaboration and article topics	Bjorn Lofroth	dbjorn
		BREAK
	11:30-11:40	Efficient 2-D Spatial Matching	Yusra AlSayyad
	11:40-11:50	Project: Parallel Query Evaluation	David Rosenbaum	Qubits
	11:50-12:00	CrudB: Crowdsourcing Uncertain Databases	Christopher Lin
	12:00:-12:10	Visgres	Brandon Blakeley
	12:10-12:30	DISCUSSIONS, PIZZA,VOTING, AWARD CEREMONY

CSE 544: Project

Overview