Project ideas (list of 2021): [Note: the best 544 class project is one related to your research] --------- This is an exploratory project, suitable for someone with interest in machine learning. Design a language to query a model, e.g. a DNN or some other model. Example, a classifier that decides whether to approve a mortgage based on (zip, profession, .... lots of attributes) Query 1: John was denied. Would John been approved had his race been different? Query 2: Is there a combination of attributes for which changing only the race would change the outcome? Query 3: Check if for every zipcode, there exists a person for which changing the race would change the decision. The goal of the project is to design the query language where such questions can be formulated, and implement it. The project is totally open ended, and quite high risk, since I won't be able to help on the ML side. -------- A systems implementation (and evaluation) project. Implement and perform an empirical evaluation of the performance of B^\varepsilon-trees. These are write-optimized trees, claimed to be better than LSM trees or B+ trees. In this project you will validate experimentally their claims. You may either use an existing implementation of B^\varepsilon trees (if one exists), then perform an extensive performance evaluation. Or, implement your own, and then it's OK if your performance evaluation is more limited. Reference: https://www.usenix.org/publications/login/oct15/bender -------- Variable order for worst-case-optimal algorithm. The state of the art optimal algorithm for computing a multi-way join is "Generic-join" from this paper https://dl.acm.org/doi/10.1145/2590989.2590991 We will discuss it in class, in a simpler form than the paper (and equally powerful). The algorithm is theoretically optimal for ANY variable order. In practice, the variable order should matter, but it is unknown how to choose a "good" variable order. In this project you will explore alternative variable orders for generic-join (a.k.a. worst-case-optimal join algorithm), and propose a cost-based optimization method for choosing an optimal variable order. -------- Normalized Schema Discovery for Training Data An ML model is trained on the "design matrix", which is a table with features F1, F2, ... and outcome variable Y. But this table has often been previously obtained through a sequence of transformations, from a relational database. In this project you will discover a normalized schema for training data. Start from some training data e.g. from Kaggle, given as a table R(F1, F2, ..., Y) and normalize it into relations S1(F_11, F_12, ...) \Join S2(F_21, F_22, ...) \Join ... The decomposition can be approximate. You will need to use existing tools for this: mining Functional Dependencies and approximate Functional Dependencies; see system by Felix Naumann https://hpi.de/naumann/home.html, and/or discovery of approximate acyclic schemas https://dl.acm.org/doi/10.1145/3318464.3380573 --------- Universe Sampling: You want to estimate the size of a join R \Join S, by computing a sample over R, one over S, then joining. Taking a Bernoulli sample from relation is inefficient, because the expected size of the join of the two samples is tiny. This paper http://www.vldb.org/pvldb/vol13/p547-huang.pdf discuss "universe sampling", which works well when the relations are not skewed. But when R and/or S have "heavy hitters", then the paper proposes to use Bernoulli sampling instead. (The paper actually computes a mixture of both.) Suggestion to explore: (a) instead of Bernoulli sample, compute and store separately the heavy hitters, then use universe sampling only for the light hitters. Measure the improvement of Universe-sampling plus heavy-hitters over universe-sampling plus Bernoulli. (b) extend the method to multiple joins, in particular to acyclic joins. --------- A Theoretical project Prove lower bounds for Boolean queries using the Strong Exponential Time Hypothesis, SETH. Consider a Boolean Conjunctive Query Q. The best known algorithm for solving the problem "given D, is Q(D) true?" is computable in time \tilde O(N^{subw(Q)}), where N is the size of the largest relation, and subw(Q) is the submodular width of Q. Prove that this is also a lower bound. It suffices to restrict the proof to a small set of queries Q. Suggestions for queries to try (some of these may be solved in the literature): -- triangle: Q() = R(x,y) and S(y,z) and T(z,x). subw(Q)=fhtw(Q)=3/2 -- 4-cycle: Q() = R(x,y) and S(y,z) and T(z,u) and K(u,x). subw(Q)=3/2, fhtw(Q)=2 -- 4-clique: Q() = R(x,y),S(x,z),T(x,u),K(y,z),L(y,u),M(z,u). subw(Q)=6 Reference: Daniel Lokshtanov, Dániel Marx, Saket Saurabh: Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105: 41-72 (2011) Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu: What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another? PODS 2017: 429-444 -------- A database project, best fit for someone interested in statistics Variant 1. (clear goals and deliverables) Perform cardinality estimation using "correlation sampling". Extend the correlation sampling method from the reference below to handle multi-join queries. Experiment with it on (1) the IMDB dataset and queries (shown below), and (2) on some synthetic multi-join queries of your choosing. Variant 2. (a bit more theoretical and higher risk) Generalize the AKS sketch for multi-join queries. David Vengerov, Andre Cavalheiro Menck, Mohamed Zaït, Sunil Chakkappen: Join Size Estimation Subject to Filter Conditions. PVLDB 8(12): 1530-1541 (2015) https://imdbpy.sourceforge.io/docs/README.sqldb.txt https://github.com/gregrahn/join-order-benchmark -------- An information extraction project. Choose the main conference of your field, and extract from the web the complete database of all program committee members from the first year when that conference was held until present. For example, we have such an extraction for the VLDB conference, in the assembla repository below. In this project you will construct a similar database of the organization of your preferred conference. The interesting part of this project is the lesson you learn to automate the process: build sufficient tools that could help you repeat this collection for another conference, or even for (say) all ACM sponsored conferences. svn co https://subversion.assembla.com/svn/vldb-bot/ -------- A systems project, consisting of experimental evaluation While distributed data processing holds the potential for improved scalability, the observation in the COST paper below is that several big data systems do this wrong: they introduce additional overhead to handle the distribution, then show that by increasing the number of servers the performance of the system improves. This is shocking finding. In this project, you verify that McSherry's observations are correct. Start by reproducing this finding from the 2017 blog (below), then extend your experiments to other systems/datasets. Frank McSherry, Michael Isard, Derek Gordon Murray: Scalability! But at what COST? HotOS 2015 https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md ---- A database / programming language project Implement incomplete databases and the "choice" operator, by compiling SQL to Rosette https://emina.github.io/rosette/ or to Alloy http://alloy.mit.edu/alloy/tutorials/online/ The goal is to extended SQL to support the following (from the MayBMS paper below): -- Choice values; think of them as alternate choices for an attribute. For example: Product ------------------------------- | Name | Color | ------------------------------- | Toy | red or blue | | Car | red or green | | Truck| blue | | Train| red or blue or yellow | ------------------------------- The meaning is the color of Toy can be either 'red' or 'blue', leading to two sets of different worlds. In total there are 2*2*3=12 possible worlds, i.e. 12 possible database instances, hence this is an "incomplete database" Extend the 'insert' statement to: insert into Product values ('phone', 'blue' or 'green') -- Choice operator: extend sql with a choice operator: select Product.name, choice(Review.score) from Product, Review where Product.name = Review.name group by Product.name Here choice(....) selects nondeterministically one score for each review. -- Implement functionality over incomplete databases: compute the "certain answer"; compute "conditional certain answers"; etc. References: Jiewen Huang, Lyublena Antova, Christoph Koch, Dan Olteanu: MayBMS: a probabilistic database management system. SIGMOD Conference 2009: 1071-1074 Todd J. Green, Val Tannen: Models for Incomplete and Probabilistic Information. IEEE Data Eng. Bull. 29(1): 17-24 (2006)