CSE 544: Mini Research Project
Project Reports for Winter 2011
Overview
A portion of your grade in 544 consists of a final project. This
project is meant to be a piece of independent research or engineering
effort related to material we have studied in class. Your project may
involve a comparison of systems we have read about, an application of
database techniques to a problem you are familiar with, or be a
database-related project in your research area.
This document describes what is expected of a final project and
proposes some possible project ideas.
Project schedule
Deadline | Milestone |
Jan 14th | Project teams formed: You should have formed your 1 to 3-person team and emailed us the names of all partners. |
Jan 19th-21st | Please schedule an appointment with Prof. Suciu to discuss the project you are planning on undertaking. |
Feb 1st | Project proposal is due. We will email you feedback on your project proposals within a few days. |
Feb 21st-25th | If needed, please schedule anappointment with Prof. Suciu to discuss your project either before orafter the project milestone. |
Mar 1st | Project milestone report is due. We will email you feedback within a few days. |
Mar 9th | Project presentation. |
Mar 11th, 12pm | Final project report is due.
|
Deliverables
Milestone Reports
Start from the project proposal, and extend it to a report of about
3-4 pages in length. Suggested content:
The description of the problem you are trying to solve
The description of your approach
A description of related work. How does your approach compare with what has been done before. Suggestion: cite about 3 papers.
A description of what you have accomplished so far and any problems or unexpected issues that you encountered.
A brief plan for the rest of the quarter. This plan must include the experiments you will conduct to evaluate your solution.
Final Report
Start from the project milestone, and extend it to a 6-7 pages
report. Suggestion for the additional content:
Improve and expand the presentation of your approach
Improve and expand the presentation of the related work, by giving ore technical detail on how your approach is similar to, or different from related work
Include an evaluation section, with graphs.
Conclusions
Your final report should be structured like a small research paper:
introduction, motivation, technical sections, related work,
conclusions, bibliography.
Presentation
Wednesday, March 11, 9:30-11
Presentation order: TBA
Format: power point slides, 15 minutes per project
What to include in the presentation:
a description of the problem: why is it important, why is it non-trivial
an overview of prior approaches, and related work
your approach
your results (theoretical, empirical, experimental)
a brief discussion on the significance of the results (do they work ? do they solve the problem you set out to do ? do they improve over existing work ?)
conclusions
Rule of thumb for the number of slides: maximum 1 slide / minute.
That is, for 15’, prepare about 12-15 slides. Use animation only
sparingly: they always slow down the presentation, and often have
the effect that they give the audience less time to read important
parts of the slides.
Multi-team projects: no, you do not get extra time! You can either
delegate one of you to give the entire presentation, or take turns.
Just make sure you stay within 15’.
Voting: each student will vote on all other project presentations, and
we will establish a ranking of all the votes. Details to follow.
Choosing a project
You have two options: you can start from one of the project ideas
listed below, or you can define your own database project that is
related to research in your main research area.
What is expected
Good class projects can vary dramatically in complexity, scope, and
topic. The only requirement is that they be related to something we
have studied in this class and that they contain some element of
research or that they involve a significant engineering effort. In the
latter case, there should still be an element of novelty in your work
(otherwise, there is not much point in doing it). To help you
determine if your idea is appropriate and of reasonable scope, we will
arrange to meet with each group several times throughout the semester.
Example projects
Here are examples of successful projects from the previous offerings of this course.
Suggested Project Ideas
SQL-SHARE Project Ideas
The project suggestions below are extensions, or new features, of
SQLShare, a database service and data
sharing website developed at the eScience Institute,
by Dr. Bill Howe. The goal of the
service is to allow scientists to start using SQL on their datasets right away
(without defining a schema, or even installing a database system) and to share
their datasets and queries.
Try SQLShare here.
Powerful, yet simple-to-use data import tool. Most data is stored
in text files, in some ad-hoc formats. This is true for scientific
data, Web logs, network logs, etc. Design a simple-to-use, yet
powerful import tool for general data formats, to allow users to
upload their data into a database in one click. The PADS system, and its extension LearnPADS, is a
general-purpose compiler for parsing ad-hoc data. In this project you
will study how LearnPADS can be extended to a general purpose database
import tool. If successful, this project can become an extension of
SQLShare, which currently accepts only limited data formats in CSV
files with a very simple structure.
Excel data import tool. Continuing the previous idea, often data
instances are stored in Excel files: in this project you will design a
tool for importing automatically excel data into a relational
database. An excel file may typically stores one, or several
relations, and the schema may or may not be given in the first row.
Design a set of simple heuristics that capture the typical cases.
Can machine learning be leveraged in such a tool?
Data auto-complete. Imagine this: you are an environmental
scientist and are inserting (manually) data from water quality
measurements in the field. You type in (say in Excel) your typical
readings: geographical location, time, ppm for Total Organic Carobon,
Fluoride, Nitrate, etc. Then SQLShare matches your data (both schema
and data values) to the rich collection of scientific datasets, and
suggests extensions. Would you like to add the ppb for Barium? For
Haloacetic Acids? Perhaps it automatically informs you of other data
values that are relevant: e.g. it finds some related dataset that
contains air temperature and pressure readings for the locations you
typed in, and informs you about those data values. In this project
you would build an Excel AddIn that will connect to SQLShare and
automatically suggests schema/data extensions.
The 20 SQL Query Examples project. The typical user of SQL Share
is a scientist that has a dataset consisting of (say) 4-6 tables,
stored in some format (CSV or Excel). Assume she can easily upload
this data in SQL Share (which is true in most cases). But now she
needs someone to give a crash course in using SQL to query her data.
In this project you will automatically generate 20 SQL queries that
are “representative” for a given database instance, say one consisting
for 4-6 tables and a few thousand tuples. The queries should
illustrate typical usage scenarios. Are two tables joinable? Give a
query example that joins them. Is an attribute a good candidate for
filtering? Generate a SQL query that illustrates a good filter
predicate for that attribute (examples that do NOT make sense:
city > 'Seattle’ (better: city = 'Seattle’)
or
length = 2.345252 (better:
length BETWEEN [2.3 .. 2.5] or length < 2.4)).
Does an aggregate
makes sense? Show a SQL query with group-by and aggregate.
Note: there is some preliminary work done on this project, talk to me
and I will point you in the right direction.
English translation and explanation of SQL queries. Consider a SQL
query written by an expert. A non-expert reads it: what does it mean
? Design a tool than can translate a SQL query in plain English, and
explain what it means to a person familiar with the database and its
domain, but unfamiliar with SQL or the relational calculus. Can your
tool benefit from examining more than one SQL query?
Adding general-purpose tools to a SQL store. One advantage of
uploading ones data in SQLShare is that users can immediately benefit
from SQL. But they cannot benefit yet from other general-purpose
tools such as machine learning algorithms (association rules, decision
trees, clustering), data visualization tools. How should such tools
be integrated in the relational model? In this project you will
design and implement a data model that allows arbitrary relational
data to be processed with arbitrary machine learning tools (e.g. Weka), visualization
tools, or other tools of general interest to scientists.
Query similarity. Amazon-style “More like this”. When experimenting
with SQL, users frequently encounter examples that “almost” do what
they want. A useful operation is therefore to find queries similar to
an existing query. Similarity in this context might be defined in
terms of
a) similar idioms (joins, subqueries, aggregates),
b) similar
tables (not all tables are disjoint in an ad hoc database),
c) similar
results, or something else, or some combination of the above.
This
project involves defining and implementing a reasonable similarity
function for queries, and using this function to implement a simple
“more like this” application.
Access control. Data security in relational databases is based on
a simple access control mechanism, which consists of granting
permissions (SELECT, UPDATE, …) to users and/or roles. Extend data
security to an environment where users contribute their own data
instances, then create views over other users’ data. Typical question
to ponder: If Alice joins her table with a table belonging to Bob,
then gives the result to Carol who deletes all entries that do not
occur in a reference dataset owned by Daisy, then what permissions
does Erol need in order to read the final view?
Provenance in a data sharing environment. Data provenance is a
formalism that tracks the source of the data items as they are
processed during query evaluation. Design and implement a provenance
mechanism for a data sharing environment.
Database minimization. Given N tables with possibly redundant tuples, find the minimum representation of them using views.
For example, if tables A and B overlap significantly, you might
represent them as
A’ = (A - B) U (A ^ B)
B’ = (B - A) U (A ^ B)
so that you can store (A ^ B) only once.
You can also reason about columns this way – find shared columns and
“normalize” them.
This problem has application in ad hoc databases to reduce duplication
and reason about provenance. For example, if someone uploads the same
file 50 times, we can a) store it once, and b) report the duplication
to the user.
Theory Projects
In these projects you are expected to make either
an interesting conceptual contribution, or prove a non-trivial
theorem, or do some more modest amount of either of these and
complement it with a small implementation.
View-based differential privacy. Alice has a private database
and wants to answer a query Q. A randomized algorithm A for answering
Q is called “differentially private” if adding or deleting a single
item to/from the database affects the output of the algorithm by only
a tiny amount. The standard way to achieve differential privacy is to
add noise, e.g. insert/delete fake tuples. There are cases, however,
where this definition is too restrictive, because there are parts of
the database that can be made public, and only some parts need to be
hidden. For example, for a table
MovieRental(customerName,
customerCity, movie, rating),
the customerName is private, but the
other three attributes can be considered public.
More generally,
consider several views V1, V2, … over a database D: the views need
to be private, but everything else is considered public. Extend the
notion of differential privacy on views. Develop a view-differential
private algorithm: is it significantly more useful than a
differentially private algorithm that applies to the entire dataset?
For your convenience, here is the formal definition of differential
privacy. The algorithm A is called “eps - differentially private” if
for all possible outputs “a” of the algorithm, and for every two
database instances D1, D2 such that |D1 - D2| <= 1,
EXP(-eps) <= Pr(A(D1)=a)/Pr(A(D2)=a) <= EXP(eps),
where the probability is taken
over the choices of the randomized algorithm.
Data pricing in data markets. Imagine a scenario in which data
owners SELL items from their dataset. Microsoft Azure already has a
“Data Market”, see here
and here. Today, you
can buy all kind of data on the Web: personal information about any
individual (convictions, income, etc – it's amazing what is for sale
out there), your personal credit score. More importantly, scientific
data that is expensive to produce and has potentially economic value
(say, protein data) can be sold. Develop a theory for pricing data
relational. For example, considering a single relation R(Key,
A,B,C,…), the owner may consider several pricing schemes:
(a) one
single item costs X,
(b) any set of K items or more costs Y,
(c) the
entire view V(Key, A) :- R(Key, A, B, …) costs Z (that is, a
projection).
What properties must such a pricing scheme have? Can
any purchase be priced, i.e. if a customer wants to buy the view
W(key, C) :- R(Key, A, B, C, …), B>100, is there a reasonable way to
compute a price for it?
Tractable Queries over Probabilistic databases. Consider
probabilistic databases where each tuple is an independent random
variable. This project asks you to characterize the queries
(sentences) Q for which P(Q) can be computed in polynomial time in the
size of the database. If Q ranges over Unions of Conjunctive Queries
then the class of tractable queries is well understood (ask me for
references). Study the following classes of queries:
Horn clauses. For example what is the complexity of computing the
probability of the rule R(x,y),R(y,z) –> R(x,z)? (In other words,
given a graph where edges have independent probabilities, compute the
probability that the graph is trasitively closed.) What about
R(x,y),S(y,z) -> T(y)? (I think that the latter is in PTIME.) You
would have to describe a large class of tractable queries, and a large
class of queries that is intractable. If every Horn clause sentence
falls into one of your two class, then you have proven a dichotomy
theorem, which would be quite interesting; but you can settle for less
than that.
Full First Order Logic. For example, what is the complexity of
computing the probability of the sentence forall x.(exists
y.R(x,y)–>exists z.S(x,z))? This sentence checks an inclusion
constraint (every value occurring in R.x also occurs in S.x).
In some cases, all tuples in a relation have exactly the same
probability. Study how this can be exploited in a query evaluation
algorithm. For example, if probabilities are arbitrary, then
computing the of the query exists x.exists y.(R(x), S(x,y), T(y)) is
hard for #P. But if all probabilities in R are equal, all
probabilities in S are equal, and all probabilities in T are equal,
(and “all tuples are present” – ask me what that means) then one can
compute the probability of this query in polynomial time in the size
of the database. Explore better ways to compute query probabilities
when all probabilities are equal.
|