Project ideas (list of 2021):

[Note: the best 544 class project is one related to your research]

---------

This is an exploratory project, suitable for someone with interest in
machine learning.

Design a language to query a model, e.g. a DNN or some other model.
Example, a classifier that decides whether to approve a mortgage based
on (zip, profession, .... lots of attributes)

Query 1: John was denied.  Would John been approved had his race been
different?

Query 2: Is there a combination of attributes for which changing only
the race would change the outcome?

Query 3: Check if for every zipcode, there exists a person for which
changing the race would change the decision.

The goal of the project is to design the query language where such
questions can be formulated, and implement it.  The project is totally
open ended, and quite high risk, since I won't be able to help on the
ML side.

--------

A systems implementation (and evaluation) project.

Implement and perform an empirical evaluation of the performance of
B^\varepsilon-trees.  These are write-optimized trees, claimed to be
better than LSM trees or B+ trees.  In this project you will validate
experimentally their claims.  You may either use an existing
implementation of B^\varepsilon trees (if one exists), then perform an
extensive performance evaluation.  Or, implement your own, and then
it's OK if your performance evaluation is more limited.

Reference:

https://www.usenix.org/publications/login/oct15/bender

--------

Variable order for worst-case-optimal algorithm.

The state of the art optimal algorithm for computing a multi-way join
is "Generic-join" from this paper
https://dl.acm.org/doi/10.1145/2590989.2590991 We will discuss it in
class, in a simpler form than the paper (and equally powerful).  The
algorithm is theoretically optimal for ANY variable order.  In
practice, the variable order should matter, but it is unknown how to
choose a "good" variable order.  In this project you will explore
alternative variable orders for generic-join
(a.k.a. worst-case-optimal join algorithm), and propose a cost-based
optimization method for choosing an optimal variable order.

--------

Normalized Schema Discovery for Training Data

An ML model is trained on the "design matrix", which is a table with
features F1, F2, ... and outcome variable Y.  But this table has often
been previously obtained through a sequence of transformations, from a
relational database.  In this project you will discover a normalized
schema for training data.  Start from some training data e.g. from
Kaggle, given as a table R(F1, F2, ..., Y) and normalize it into
relations S1(F_11, F_12, ...) \Join S2(F_21, F_22, ...) \Join ...  The
decomposition can be approximate.  You will need to use existing tools
for this: mining Functional Dependencies and approximate Functional
Dependencies; see system by Felix Naumann
https://hpi.de/naumann/home.html, and/or discovery of approximate
acyclic schemas https://dl.acm.org/doi/10.1145/3318464.3380573


---------

Universe Sampling:

You want to estimate the size of a join R \Join S, by computing a
sample over R, one over S, then joining.  Taking a Bernoulli sample
from relation is inefficient, because the expected size of the join of
the two samples is tiny.  This paper
http://www.vldb.org/pvldb/vol13/p547-huang.pdf discuss "universe
sampling", which works well when the relations are not skewed.  But
when R and/or S have "heavy hitters", then the paper proposes to use
Bernoulli sampling instead.  (The paper actually computes a mixture of
both.)  Suggestion to explore:

   (a) instead of Bernoulli sample, compute and store separately the
   heavy hitters, then use universe sampling only for the light
   hitters.  Measure the improvement of Universe-sampling plus
   heavy-hitters over universe-sampling plus Bernoulli.

   (b) extend the method to multiple joins, in particular to acyclic
   joins.


---------

A Theoretical project


Prove lower bounds for Boolean queries using the Strong Exponential
Time Hypothesis, SETH.  Consider a Boolean Conjunctive Query Q.  The
best known algorithm for solving the problem "given D, is Q(D) true?"
is computable in time \tilde O(N^{subw(Q)}), where N is the size of
the largest relation, and subw(Q) is the submodular width of Q.  Prove
that this is also a lower bound.  It suffices to restrict the proof to
a small set of queries Q.  Suggestions for queries to try (some of
these may be solved in the literature):

-- triangle: Q() = R(x,y) and S(y,z) and T(z,x).  subw(Q)=fhtw(Q)=3/2
-- 4-cycle:  Q() = R(x,y) and S(y,z) and T(z,u) and K(u,x). subw(Q)=3/2, fhtw(Q)=2
-- 4-clique: Q() = R(x,y),S(x,z),T(x,u),K(y,z),L(y,u),M(z,u).  subw(Q)=6

Reference:

Daniel Lokshtanov, Dániel Marx, Saket Saurabh:
Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105: 41-72 (2011)

Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu:
What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another? PODS 2017: 429-444


--------

A database project, best fit for someone interested in statistics

Variant 1. (clear goals and deliverables) Perform cardinality
estimation using "correlation sampling".  Extend the correlation
sampling method from the reference below to handle multi-join queries.
Experiment with it on (1) the IMDB dataset and queries (shown below),
and (2) on some synthetic multi-join queries of your choosing.


Variant 2. (a bit more theoretical and higher risk) Generalize the AKS
sketch for multi-join queries.


David Vengerov, Andre Cavalheiro Menck, Mohamed Zaït, Sunil Chakkappen:
Join Size Estimation Subject to Filter Conditions. PVLDB 8(12): 1530-1541 (2015)


https://imdbpy.sourceforge.io/docs/README.sqldb.txt
https://github.com/gregrahn/join-order-benchmark


--------

An information extraction project.

Choose the main conference of your field, and extract from the web the
complete database of all program committee members from the first year
when that conference was held until present.  For example, we have
such an extraction for the VLDB conference, in the assembla repository
below.  In this project you will construct a similar database of the
organization of your preferred conference.  The interesting part of
this project is the lesson you learn to automate the process: build
sufficient tools that could help you repeat this collection for
another conference, or even for (say) all ACM sponsored conferences.

svn co https://subversion.assembla.com/svn/vldb-bot/


--------

A systems project, consisting of experimental evaluation

While distributed data processing holds the potential for improved
scalability, the observation in the COST paper below is that several
big data systems do this wrong: they introduce additional overhead to
handle the distribution, then show that by increasing the number of
servers the performance of the system improves.  This is shocking
finding.  In this project, you verify that McSherry's observations are
correct.  Start by reproducing this finding from the 2017 blog
(below), then extend your experiments to other systems/datasets.


Frank McSherry, Michael Isard, Derek Gordon Murray:
Scalability! But at what COST? HotOS 2015

https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md


----

A database / programming language project

Implement incomplete databases and the "choice" operator, by compiling
SQL to Rosette https://emina.github.io/rosette/ or to Alloy
http://alloy.mit.edu/alloy/tutorials/online/

The goal is to extended SQL to support the following (from the MayBMS
paper below):

-- Choice values; think of them as alternate choices for an
   attribute.  For example:

   Product
   -------------------------------
   | Name |  Color                |
   -------------------------------
   | Toy  | red or blue  	  |
   | Car  | red or green 	  |
   | Truck| blue         	           |
   | Train| red or blue or yellow |
   -------------------------------

   The meaning is the color of Toy can be either 'red' or 'blue',
   leading to two sets of different worlds.  In total there are
   2*2*3=12 possible worlds, i.e. 12 possible database instances,
   hence this is an "incomplete database"

   Extend the 'insert' statement to:

   insert into Product values ('phone', 'blue' or 'green')

-- Choice operator: extend sql with a choice operator:

   select Product.name, choice(Review.score)
   from Product, Review
   where Product.name = Review.name
   group by Product.name

   Here choice(....) selects nondeterministically one score for each
   review.

-- Implement functionality over incomplete databases: compute the
   "certain answer"; compute "conditional certain answers"; etc. 

References:

Jiewen Huang, Lyublena Antova, Christoph Koch, Dan Olteanu:
MayBMS: a probabilistic database management system. SIGMOD Conference 2009: 1071-1074

Todd J. Green, Val Tannen:
Models for Incomplete and Probabilistic Information. IEEE Data Eng. Bull. 29(1): 17-24 (2006)