CSE 544: Mini Research Project

Project Reports for Winter 2011

Bloom Filters in Distributed Query Execution
- Paraschos Koutris
Improv: Flexible Data Provenance for Relational Databases
- Nicholas Hunt and Peter Hornyack
More Like This: Query Recommendations for SQL
- Christopher Miles
Pricing for Data Markets
- Avanish Kushal, Sharmadha Moorthy, and Vikash Kumar
Sample Query Generation: Joins and Projection
- Jason Ganzhorn
Sensor Data Exploration: Making Stream Data Management Systems More Usable in Monitoring Applications
- Shengliang Xu

Overview

A portion of your grade in 544 consists of a final project. This project is meant to be a piece of independent research or engineering effort related to material we have studied in class. Your project may involve a comparison of systems we have read about, an application of database techniques to a problem you are familiar with, or be a database-related project in your research area.

This document describes what is expected of a final project and proposes some possible project ideas.

Project schedule

Deadline	Milestone
Jan 14th	Project teams formed: You should have formed your 1 to 3-person team and emailed us the names of all partners.
Jan 19th-21st	Please schedule an appointment with Prof. Suciu to discuss the project you are planning on undertaking.
Feb 1st	Project proposal is due. We will email you feedback on your project proposals within a few days.
Feb 21st-25th	If needed, please schedule anappointment with Prof. Suciu to discuss your project either before orafter the project milestone.
Mar 1st	Project milestone report is due. We will email you feedback within a few days.
Mar 9th	Project presentation.
Mar 11th, `12pm`	Final project report is due.

Deliverables

Milestone Reports

Start from the project proposal, and extend it to a report of about 3-4 pages in length. Suggested content:

The description of the problem you are trying to solve
The description of your approach
A description of related work. How does your approach compare with what has been done before. Suggestion: cite about 3 papers.
A description of what you have accomplished so far and any problems or unexpected issues that you encountered.
A brief plan for the rest of the quarter. This plan must include the experiments you will conduct to evaluate your solution.

Final Report

Start from the project milestone, and extend it to a 6-7 pages report. Suggestion for the additional content:

Improve and expand the presentation of your approach
Improve and expand the presentation of the related work, by giving ore technical detail on how your approach is similar to, or different from related work
Include an evaluation section, with graphs.
Conclusions

Your final report should be structured like a small research paper: introduction, motivation, technical sections, related work, conclusions, bibliography.

Presentation

Wednesday, March 11, 9:30-11

Presentation order: TBA

Format: power point slides, 15 minutes per project

What to include in the presentation:

a description of the problem: why is it important, why is it non-trivial
an overview of prior approaches, and related work
your approach
your results (theoretical, empirical, experimental)
a brief discussion on the significance of the results (do they work ? do they solve the problem you set out to do ? do they improve over existing work ?)
conclusions

Rule of thumb for the number of slides: maximum 1 slide / minute. That is, for 15’, prepare about 12-15 slides. Use animation only sparingly: they always slow down the presentation, and often have the effect that they give the audience less time to read important parts of the slides.

Multi-team projects: no, you do not get extra time! You can either delegate one of you to give the entire presentation, or take turns. Just make sure you stay within 15’.

Voting: each student will vote on all other project presentations, and we will establish a ranking of all the votes. Details to follow.

Choosing a project

You have two options: you can start from one of the project ideas listed below, or you can define your own database project that is related to research in your main research area.

What is expected

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research or that they involve a significant engineering effort. In the latter case, there should still be an element of novelty in your work (otherwise, there is not much point in doing it). To help you determine if your idea is appropriate and of reasonable scope, we will arrange to meet with each group several times throughout the semester.

Example projects

Here are examples of successful projects from the previous offerings of this course.

Winter 2009
- KAMD: A Progress Estimator for MapReduce Pipelines by Kristi Morton and Abe Friesen
- Towards Real-Time Data Stream Processing by Jessica Chang and Andrey Kolobov
Fall 2007
- Relational Data Markets by Ben Birnbaum and Alex Jaffe.
- Machine Learning for Automatic Physical DBMS Tuning by Ivan Beschastnikh and Andrew Guillory.
- An Image-based Positioning System by Ankit Gupta, Rahul Garg, and Ryan Kaminsky.
- Data Partitioning and Indexing for Network Forensic Analysis by Cherie Cheung Jue Wang
- Knowledge-Guided Schema Discovery from Semi-Structured Text by Thomas Lin.

Suggested Project Ideas

SQL-SHARE Project Ideas

The project suggestions below are extensions, or new features, of SQLShare, a database service and data sharing website developed at the eScience Institute, by Dr. Bill Howe. The goal of the service is to allow scientists to start using SQL on their datasets right away (without defining a schema, or even installing a database system) and to share their datasets and queries.

Try SQLShare here.

Powerful, yet simple-to-use data import tool. Most data is stored in text files, in some ad-hoc formats. This is true for scientific data, Web logs, network logs, etc. Design a simple-to-use, yet powerful import tool for general data formats, to allow users to upload their data into a database in one click. The PADS system, and its extension LearnPADS, is a general-purpose compiler for parsing ad-hoc data. In this project you will study how LearnPADS can be extended to a general purpose database import tool. If successful, this project can become an extension of SQLShare, which currently accepts only limited data formats in CSV files with a very simple structure.
Excel data import tool. Continuing the previous idea, often data instances are stored in Excel files: in this project you will design a tool for importing automatically excel data into a relational database. An excel file may typically stores one, or several relations, and the schema may or may not be given in the first row. Design a set of simple heuristics that capture the typical cases. Can machine learning be leveraged in such a tool?
Data auto-complete. Imagine this: you are an environmental scientist and are inserting (manually) data from water quality measurements in the field. You type in (say in Excel) your typical readings: geographical location, time, ppm for Total Organic Carobon, Fluoride, Nitrate, etc. Then SQLShare matches your data (both schema and data values) to the rich collection of scientific datasets, and suggests extensions. Would you like to add the ppb for Barium? For Haloacetic Acids? Perhaps it automatically informs you of other data values that are relevant: e.g. it finds some related dataset that contains air temperature and pressure readings for the locations you typed in, and informs you about those data values. In this project you would build an Excel AddIn that will connect to SQLShare and automatically suggests schema/data extensions.
The 20 SQL Query Examples project. The typical user of SQL Share is a scientist that has a dataset consisting of (say) 4-6 tables, stored in some format (CSV or Excel). Assume she can easily upload this data in SQL Share (which is true in most cases). But now she needs someone to give a crash course in using SQL to query her data. In this project you will automatically generate 20 SQL queries that are “representative” for a given database instance, say one consisting for 4-6 tables and a few thousand tuples. The queries should illustrate typical usage scenarios. Are two tables joinable? Give a query example that joins them. Is an attribute a good candidate for filtering? Generate a SQL query that illustrates a good filter predicate for that attribute (examples that do NOT make sense:
city > 'Seattle’ (better: city = 'Seattle’)
or
length = 2.345252 (better: length BETWEEN [2.3 .. 2.5] or length < 2.4)).
Does an aggregate makes sense? Show a SQL query with group-by and aggregate.
Note: there is some preliminary work done on this project, talk to me and I will point you in the right direction.
English translation and explanation of SQL queries. Consider a SQL query written by an expert. A non-expert reads it: what does it mean ? Design a tool than can translate a SQL query in plain English, and explain what it means to a person familiar with the database and its domain, but unfamiliar with SQL or the relational calculus. Can your tool benefit from examining more than one SQL query?
Adding general-purpose tools to a SQL store. One advantage of uploading ones data in SQLShare is that users can immediately benefit from SQL. But they cannot benefit yet from other general-purpose tools such as machine learning algorithms (association rules, decision trees, clustering), data visualization tools. How should such tools be integrated in the relational model? In this project you will design and implement a data model that allows arbitrary relational data to be processed with arbitrary machine learning tools (e.g. Weka), visualization tools, or other tools of general interest to scientists.
Query similarity. Amazon-style “More like this”. When experimenting with SQL, users frequently encounter examples that “almost” do what they want. A useful operation is therefore to find queries similar to an existing query. Similarity in this context might be defined in terms of
a) similar idioms (joins, subqueries, aggregates),
b) similar tables (not all tables are disjoint in an ad hoc database),
c) similar results, or something else, or some combination of the above.
This project involves defining and implementing a reasonable similarity function for queries, and using this function to implement a simple “more like this” application.
Access control. Data security in relational databases is based on a simple access control mechanism, which consists of granting permissions (SELECT, UPDATE, …) to users and/or roles. Extend data security to an environment where users contribute their own data instances, then create views over other users’ data. Typical question to ponder: If Alice joins her table with a table belonging to Bob, then gives the result to Carol who deletes all entries that do not occur in a reference dataset owned by Daisy, then what permissions does Erol need in order to read the final view?
Provenance in a data sharing environment. Data provenance is a formalism that tracks the source of the data items as they are processed during query evaluation. Design and implement a provenance mechanism for a data sharing environment.
Database minimization. Given N tables with possibly redundant tuples, find the minimum representation of them using views.
For example, if tables A and B overlap significantly, you might represent them as
A’ = (A - B) U (A ^ B)
B’ = (B - A) U (A ^ B)
so that you can store (A ^ B) only once.
You can also reason about columns this way – find shared columns and “normalize” them.
This problem has application in ad hoc databases to reduce duplication and reason about provenance. For example, if someone uploads the same file 50 times, we can a) store it once, and b) report the duplication to the user.

Theory Projects

In these projects you are expected to make either an interesting conceptual contribution, or prove a non-trivial theorem, or do some more modest amount of either of these and complement it with a small implementation.

View-based differential privacy. Alice has a private database and wants to answer a query Q. A randomized algorithm A for answering Q is called “differentially private” if adding or deleting a single item to/from the database affects the output of the algorithm by only a tiny amount. The standard way to achieve differential privacy is to add noise, e.g. insert/delete fake tuples. There are cases, however, where this definition is too restrictive, because there are parts of the database that can be made public, and only some parts need to be hidden. For example, for a table
MovieRental(customerName, customerCity, movie, rating),
the customerName is private, but the other three attributes can be considered public.
More generally, consider several views V1, V2, … over a database D: the views need to be private, but everything else is considered public. Extend the notion of differential privacy on views. Develop a view-differential private algorithm: is it significantly more useful than a differentially private algorithm that applies to the entire dataset?
For your convenience, here is the formal definition of differential privacy. The algorithm A is called “eps - differentially private” if for all possible outputs “a” of the algorithm, and for every two database instances D1, D2 such that |D1 - D2| <= 1,
EXP(-eps) <= Pr(A(D1)=a)/Pr(A(D2)=a) <= EXP(eps),
where the probability is taken over the choices of the randomized algorithm.
Data pricing in data markets. Imagine a scenario in which data owners SELL items from their dataset. Microsoft Azure already has a “Data Market”, see here and here. Today, you can buy all kind of data on the Web: personal information about any individual (convictions, income, etc – it's amazing what is for sale out there), your personal credit score. More importantly, scientific data that is expensive to produce and has potentially economic value (say, protein data) can be sold. Develop a theory for pricing data relational. For example, considering a single relation R(Key, A,B,C,…), the owner may consider several pricing schemes:
(a) one single item costs X,
(b) any set of K items or more costs Y,
(c) the entire view V(Key, A) :- R(Key, A, B, …) costs Z (that is, a projection).
What properties must such a pricing scheme have? Can any purchase be priced, i.e. if a customer wants to buy the view W(key, C) :- R(Key, A, B, C, …), B>100, is there a reasonable way to compute a price for it?
Tractable Queries over Probabilistic databases. Consider probabilistic databases where each tuple is an independent random variable. This project asks you to characterize the queries (sentences) Q for which P(Q) can be computed in polynomial time in the size of the database. If Q ranges over Unions of Conjunctive Queries then the class of tractable queries is well understood (ask me for references). Study the following classes of queries:
1. Horn clauses. For example what is the complexity of computing the probability of the rule R(x,y),R(y,z) –> R(x,z)? (In other words, given a graph where edges have independent probabilities, compute the probability that the graph is trasitively closed.) What about R(x,y),S(y,z) -> T(y)? (I think that the latter is in PTIME.) You would have to describe a large class of tractable queries, and a large class of queries that is intractable. If every Horn clause sentence falls into one of your two class, then you have proven a dichotomy theorem, which would be quite interesting; but you can settle for less than that.
2. Full First Order Logic. For example, what is the complexity of computing the probability of the sentence forall x.(exists y.R(x,y)–>exists z.S(x,z))? This sentence checks an inclusion constraint (every value occurring in R.x also occurs in S.x).
3. In some cases, all tuples in a relation have exactly the same probability. Study how this can be exploited in a query evaluation algorithm. For example, if probabilities are arbitrary, then computing the of the query exists x.exists y.(R(x), S(x,y), T(y)) is hard for #P. But if all probabilities in R are equal, all probabilities in S are equal, and all probabilities in T are equal, (and “all tuples are present” – ask me what that means) then one can compute the probability of this query in polynomial time in the size of the database. Explore better ways to compute query probabilities when all probabilities are equal.