Project ideas added in 2020:
Universe Sampling:
You want to estimate the size of a join R \Join S, by computing a
sample over R, one over S, then joining. Taking a Bernoulli sample
from relation is inefficient, because the expected size of the join of
the two samples is tiny. This paper
http://www.vldb.org/pvldb/vol13/p547-huang.pdf discuss "universe
sampling", which works well when the relations are not skewed. But
when R and/or S have "heavy hitters", then the paper proposes to use
Bernoulli sampling instead. (The paper actually computes a mixture of
both.) Suggestion to explore:
(a) instead of Bernoulli sample, compute and store separately the
heavy hitters, then use universe sampling only for the light
hitters. Measure the improvement of Universe-sampling plus
heavy-hitters over universe-sampling plus Bernoulli.
(b) extend the method to multiple joins, in particular to acyclic
joins.
--------
Variable order for worst-case-optimal algorithm.
The state of the art optimal algorithm for computing a multi-way join
is "Generic-join" from this paper
https://dl.acm.org/doi/10.1145/2590989.2590991 We will discuss it in
class, in a simpler form than the paper (and equally powerful). The
algorithm is theoretically optimal for ANY variable order. In
practice, the variable order should matter, but it is unknown how to
choose a "good" variable order. In this project you will explore
alternative variable orders for generic-join
(a.k.a. worst-case-optimal join algorithm), and propose a cost-based
optimization method for choosing an optimal variable order.
--------
Normalized Schema Discovery for Training Data
An ML model is trained on the "design matrix", which is a table with
features F1, F2, ... and outcome variable Y. But this table has
often been previously obtained through a sequence of transformations,
from a relational database. In this project you will discover a
normalized schema for training data. Start from some training data
e.g. from Kaggle, given as a table R(F1, F2, ..., Y) and normalize it
into relations S1(F_11, F_12, ...) \Join S2(F_21, F_22, ...) \Join
... The decomposition can be approximate. You will need to use
existing tools for this: mining Functional Dependencies and
approximate Functional Dependencies (see system by Felix Naumann
https://hpi.de/naumann/home.html, and/or discovery of approximate
acyclic schemas developed here https://arxiv.org/abs/1911.12933
================================================================
Project ideas for 2018:
This list contains a few suggestions, more may be added in in the
first 2-3 weeks of the quarter.
You can either choose from the list below, or come up with your own
project, or do something in between.
---------
An ML/DB project.
The paper referenced below uses Deep NN to compress an index; see also
the subsequent blog and the discussion therein. In this project you
will extend this idea from one dimension to multiple dimensions. In
other words, compress a multi-dimensional data. An MM database is
just a table, where most of attribute combinations are present, and
there is one measured variable. For example:
Sales(productID, storeID, dayOfWeek, quantity)
The table encodes a function f(productID, storeID, dayOfWeek) =
quantity, and does this by storing all N^3 combinations of
productID/storeID/dayOfWeek. Use ML to "learn" a simpler
representation of the function, then combine it with a local search to
find the exact value (see the referenced paper). Experiment with some
simple datasets.
http://learningsys.org/nips17/assets/papers/paper_22.pdf
http://databasearchitects.blogspot.de/2017/12/the-case-for-b-tree-index-structures.html?m=1
---------
A Theoretical project
Prove lower bounds for Boolean queries using the Strong Exponential
Time Hypothesis, SETH. Consider a Boolean Conjunctive Query Q. The
best known algorithm for solving the problem "given D, is Q(D) true?"
is computable in time \tilde O(N^{subw(Q)}), where N is the size of
the largest relation, and subw(Q) is the submodular width of Q. Prove
that this is also a lower bound. It suffices to restrict the proof to
a small set of queries Q. Suggestions for queries to try (some of
these may be solved in the literature):
-- triangle: Q() = R(x,y) and S(y,z) and T(z,x). subw(Q)=fhtw(Q)=3/2
-- 4-cycle: Q() = R(x,y) and S(y,z) and T(z,u) and K(u,x). subw(Q)=3/2, fhtw(Q)=2
-- 4-clique: Q() = R(x,y),S(x,z),T(x,u),K(y,z),L(y,u),M(z,u). subw(Q)=6
Reference:
Daniel Lokshtanov, Dániel Marx, Saket Saurabh:
Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105: 41-72 (2011)
Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu:
What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another? PODS 2017: 429-444
--------
A systems implementation (and evaluation) project.
Implement and perform an empirical evaluation of the performance of
B^\varepsilon-trees. These are write-optimized trees, claimed to be
better than LSM trees or B+ trees. In this project you will validate
experimentally their claims. You may either use an existing
implementation of B^\varepsilon trees (if one exists), then perform an
extensive performance evaluation. Or, implement your own, and then
it's OK if your performance evaluation is more limited.
Reference:
https://www.usenix.org/publications/login/oct15/bender
--------
A database project, best fit for someone interested in statistics
Variant 1. (clear goals and deliverables) Perform cardinality
estimation using "correlation sampling". Extend the correlation
sampling method from the reference below to handle multi-join queries.
Experiment with it on (1) the IMDB dataset and queries (shown below),
and (2) on some synthetic multi-join queries of your choosing.
Variant 2. (a bit more theoretical and higher risk) Generalize the AKS
sketch for multi-join queries.
David Vengerov, Andre Cavalheiro Menck, Mohamed Zaït, Sunil Chakkappen:
Join Size Estimation Subject to Filter Conditions. PVLDB 8(12): 1530-1541 (2015)
https://imdbpy.sourceforge.io/docs/README.sqldb.txt
https://github.com/gregrahn/join-order-benchmark
--------
A data analysis project; open ended.
Integrate the dblp data with citation data from arnet miner. Do some
interesting analysis about the citation graph.
References:
http://dblp.uni-trier.de/xml/
https://aminer.org/data
--------
An information extraction project.
Choose the main conference of your field, and extract from the web the
complete database of all program committee members from the first year
when that conference was held until present. For example, we have
such an extraction for the VLDB conference, in the assembla repository
below. In this project you will construct a similar database of the
organization of your preferred conference. The interesting part of
this project is the lesson you learn to automate the process: build
sufficient tools that could help you repeat this collection for
another conference, or even for (say) all ACM sponsored conferences.
svn co https://subversion.assembla.com/svn/vldb-bot/
--------
A systems project, consisting of experimental evaluation
While distributed data processing holds the potential for improved
scalability, the observation in the COST paper below is that several
big data systems do this wrong: they introduce additional overhead to
handle the distribution, then show that by increasing the number of
servers the performance of the system improves. This is shocking
finding. In this project, you verify that McSherry's observations are
correct. Start by reproducing this finding from the 2017 blog
(below), then extend your experiments to other systems/datasets.
Frank McSherry, Michael Isard, Derek Gordon Murray:
Scalability! But at what COST? HotOS 2015
https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md
--------
A theoretical project.
This projects concerns the optimal communication cost to compute a
conjunctive query on a large database over p servers. For the join of
two relations R \Join S, the optimal communication per server is:
L = O( |Input| / p + \sqrt(|Output| / p) )
Generalize this to more complex conjunctive queries. Two hints:
- most theoretical algorithms studied so far are hash-based. Try
using sort-based algorithm instead. The reference below describes
a simple sort-based algorithm for join; can it be generalized?
- what is the general optimal load? This is an open question, but my
guess is that the formula is:
L = O( |Input| / p^{1/tau^*} + ( |Output| / p)^{1/\rho*} )
where \rho*, \tau* are the fractional edge covering number, and the
fractional edge packing number of the query.
In this project you will start from simple queries, study
algorithm/lower bounds, then try to extend to more general queries.
Reference:
Algorithmic Aspects of Parallel Data Processing
Paraschos Koutris, Semih Salihoglu, Dan Suciu
(in press -- ask me for a copy)
-----
A database / programming language project
Implement incomplete databases and the "choice" operator, by compiling
SQL to Rosette https://emina.github.io/rosette/ or to Alloy
http://alloy.mit.edu/alloy/tutorials/online/
The goal is to extended SQL to support the following:
-- Choice values; think of them as alternate choices for an
attribute. For example:
Product
-------------------------------
| Name | Color |
-------------------------------
| Toy | red or blue |
| Car | red or green |
| Truck| blue |
| Train| red or blue or yellow |
-------------------------------
The meaning is the color of Toy can be either 'red' or 'blue',
leading to two sets of different worlds. In total there are
2*2*3=12 possible worlds, i.e. 12 possible database instances,
hence this is an "incomplete database"
Extend the 'insert' statement to:
insert into Product values ('phone', 'blue' or 'green')
-- Choice operator: extend sql with a choice operator:
select Product.name, choice(Review.score)
from Product, Review
where Product.name = Review.name
group by Product.name
Here choice(....) selects nondeterministically one score for each
review.
-- Implement functionality over incomplete databases: compute the
"certain answer"; compute "conditional certain answers"; etc.
References:
Jiewen Huang, Lyublena Antova, Christoph Koch, Dan Olteanu:
MayBMS: a probabilistic database management system. SIGMOD Conference 2009: 1071-1074
Todd J. Green, Val Tannen:
Models for Incomplete and Probabilistic Information. IEEE Data Eng. Bull. 29(1): 17-24 (2006)
---------
This is an exploratory project, suitable for someone with some
knowledge of Deep Learning.
Design a query language for DNN (deep neural networks)
Example, a classifier that decides whether to approve a mortgage based
on (zip, profession, .... lots of attributes)
Query 1: John was denied. Would John been approved had his race been
different?
Query 2: Is there any person for which changing the race would change
the outcome?
Query 3: Check if for every zipcode, there exists a person for which
changing the race would change the decision.
The goal of the project is to design the query language where such
questions can be formulated, and implement it. The project is totally
open ended, and quite high risk, since I won't be able to help on the
ML side.