Project ideas for 2018:

This list contains a few suggestions, more may be added in in the
first 2-3 weeks of the quarter.

You can either choose from the list below, or come up with your own
project, or do something in between.


---------

An ML/DB project.

The paper referenced below uses Deep NN to compress an index; see also
the subsequent blog and the discussion therein.  In this project you
will extend this idea from one dimension to multiple dimensions.  In
other words, compress a multi-dimensional data.  An MM database is
just a table, where most of attribute combinations are present, and
there is one measured variable.  For example:

   Sales(productID, storeID, dayOfWeek, quantity)

The table encodes a function f(productID, storeID, dayOfWeek) =
quantity, and does this by storing all N^3 combinations of
productID/storeID/dayOfWeek.  Use ML to "learn" a simpler
representation of the function, then combine it with a local search to
find the exact value (see the referenced paper).  Experiment with some
simple datasets.


http://learningsys.org/nips17/assets/papers/paper_22.pdf

http://databasearchitects.blogspot.de/2017/12/the-case-for-b-tree-index-structures.html?m=1


---------

A Theoretical project


Prove lower bounds for Boolean queries using the Strong Exponential
Time Hypothesis, SETH.  Consider a Boolean Conjunctive Query Q.  The
best known algorithm for solving the problem "given D, is Q(D) true?"
is computable in time \tilde O(N^{subw(Q)}), where N is the size of
the largest relation, and subw(Q) is the submodular width of Q.  Prove
that this is also a lower bound.  It suffices to restrict the proof to
a small set of queries Q.  Suggestions for queries to try (some of
these may be solved in the literature):

-- triangle: Q() = R(x,y) and S(y,z) and T(z,x).  subw(Q)=fhtw(Q)=3/2
-- 4-cycle:  Q() = R(x,y) and S(y,z) and T(z,u) and K(u,x). subw(Q)=3/2, fhtw(Q)=2
-- 4-clique: Q() = R(x,y),S(x,z),T(x,u),K(y,z),L(y,u),M(z,u).  subw(Q)=6

Reference:

Daniel Lokshtanov, Dániel Marx, Saket Saurabh:
Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105: 41-72 (2011)

Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu:
What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another? PODS 2017: 429-444

--------

A systems implementation (and evaluation) project.

Implement and perform an empirical evaluation of the performance of
B^\varepsilon-trees.  These are write-optimized trees, claimed to be
better than LSM trees or B+ trees.  In this project you will validate
experimentally their claims.  You may either use an existing
implementation of B^\varepsilon trees (if one exists), then perform an
extensive performance evaluation.  Or, implement your own, and then
it's OK if your performance evaluation is more limited.

Reference:

https://www.usenix.org/publications/login/oct15/bender


--------

A database project, best fit for someone interested in statistics

Variant 1. (clear goals and deliverables) Perform cardinality
estimation using "correlation sampling".  Extend the correlation
sampling method from the reference below to handle multi-join queries.
Experiment with it on (1) the IMDB dataset and queries (shown below),
and (2) on some synthetic multi-join queries of your choosing.


Variant 2. (a bit more theoretical and higher risk) Generalize the AKS
sketch for multi-join queries.


David Vengerov, Andre Cavalheiro Menck, Mohamed Zaït, Sunil Chakkappen:
Join Size Estimation Subject to Filter Conditions. PVLDB 8(12): 1530-1541 (2015)


https://imdbpy.sourceforge.io/docs/README.sqldb.txt
https://github.com/gregrahn/join-order-benchmark

--------

A data analysis project; open ended.

Integrate the dblp data with citation data from arnet miner.  Do some
interesting analysis about the citation graph.


References:

http://dblp.uni-trier.de/xml/
https://aminer.org/data


--------

An information extraction project.

Choose the main conference of your field, and extract from the web the
complete database of all program committee members from the first year
when that conference was held until present.  For example, we have
such an extraction for the VLDB conference, in the assembla repository
below.  In this project you will construct a similar database of the
organization of your preferred conference.  The interesting part of
this project is the lesson you learn to automate the process: build
sufficient tools that could help you repeat this collection for
another conference, or even for (say) all ACM sponsored conferences.

svn co https://subversion.assembla.com/svn/vldb-bot/


--------

A systems project, consisting of experimental evaluation

While distributed data processing holds the potential for improved
scalability, the observation in the COST paper below is that several
big data systems do this wrong: they introduce additional overhead to
handle the distribution, then show that by increasing the number of
servers the performance of the system improves.  This is shocking
finding.  In this project, you verify that McSherry's observations are
correct.  Start by reproducing this finding from the 2017 blog
(below), then extend your experiments to other systems/datasets.


Frank McSherry, Michael Isard, Derek Gordon Murray:
Scalability! But at what COST? HotOS 2015

https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md


--------

A theoretical project.

This projects concerns the optimal communication cost to compute a
conjunctive query on a large database over p servers.  For the join of
two relations R \Join S, the optimal communication per server is:

   L = O( |Input| / p  +  \sqrt(|Output| / p) )

Generalize this to more complex conjunctive queries.  Two hints:

 - most theoretical algorithms studied so far are hash-based.  Try
   using sort-based algorithm instead.  The reference below describes
   a simple sort-based algorithm for join; can it be generalized?

 - what is the general optimal load?  This is an open question, but my
   guess is that the formula is:

  L = O( |Input| / p^{1/tau^*}  +  ( |Output| / p)^{1/\rho*} )

where \rho*, \tau* are the fractional edge covering number, and the
fractional edge packing number of the query.

In this project you will start from simple queries, study
algorithm/lower bounds, then try to extend to more general queries.

Reference:

Algorithmic Aspects of Parallel Data Processing
Paraschos Koutris, Semih Salihoglu, Dan Suciu
(in press -- ask me for a copy)


-----

A database / programming language project

Implement incomplete databases and the "choice" operator, by compiling
SQL to Rosette https://emina.github.io/rosette/ or to Alloy
http://alloy.mit.edu/alloy/tutorials/online/

The goal is to extended SQL to support the following:

-- Choice values; think of them as alternate choices for an
   attribute.  For example:

   Product
   -------------------------------
   | Name |  Color                |
   -------------------------------
   | Toy  | red or blue  	  |
   | Car  | red or green 	  |
   | Truck| blue         	           |
   | Train| red or blue or yellow |
   -------------------------------

   The meaning is the color of Toy can be either 'red' or 'blue',
   leading to two sets of different worlds.  In total there are
   2*2*3=12 possible worlds, i.e. 12 possible database instances,
   hence this is an "incomplete database"

   Extend the 'insert' statement to:

   insert into Product values ('phone', 'blue' or 'green')

-- Choice operator: extend sql with a choice operator:

   select Product.name, choice(Review.score)
   from Product, Review
   where Product.name = Review.name
   group by Product.name

   Here choice(....) selects nondeterministically one score for each
   review.

-- Implement functionality over incomplete databases: compute the
   "certain answer"; compute "conditional certain answers"; etc. 

References:

Jiewen Huang, Lyublena Antova, Christoph Koch, Dan Olteanu:
MayBMS: a probabilistic database management system. SIGMOD Conference 2009: 1071-1074

Todd J. Green, Val Tannen:
Models for Incomplete and Probabilistic Information. IEEE Data Eng. Bull. 29(1): 17-24 (2006)

---------

This is an exploratory project, suitable for someone with some
knowledge of Deep Learning.

Design a query language for DNN (deep neural networks)
Example, a classifier that decides whether to approve a mortgage based
on (zip, profession, .... lots of attributes)

Query 1: John was denied.  Would John been approved had his race been
different?

Query 2: Is there any person for which changing the race would change
the outcome?

Query 3: Check if for every zipcode, there exists a person for which
changing the race would change the decision.

The goal of the project is to design the query language where such
questions can be formulated, and implement it.  The project is totally
open ended, and quite high risk, since I won't be able to help on the
ML side.