Project ideas for 2018: This list contains a few suggestions, more may be added in in the first 2-3 weeks of the quarter. You can either choose from the list below, or come up with your own project, or do something in between. --------- An ML/DB project. The paper referenced below uses Deep NN to compress an index; see also the subsequent blog and the discussion therein. In this project you will extend this idea from one dimension to multiple dimensions. In other words, compress a multi-dimensional data. An MM database is just a table, where most of attribute combinations are present, and there is one measured variable. For example: Sales(productID, storeID, dayOfWeek, quantity) The table encodes a function f(productID, storeID, dayOfWeek) = quantity, and does this by storing all N^3 combinations of productID/storeID/dayOfWeek. Use ML to "learn" a simpler representation of the function, then combine it with a local search to find the exact value (see the referenced paper). Experiment with some simple datasets. http://learningsys.org/nips17/assets/papers/paper_22.pdf http://databasearchitects.blogspot.de/2017/12/the-case-for-b-tree-index-structures.html?m=1 --------- A Theoretical project Prove lower bounds for Boolean queries using the Strong Exponential Time Hypothesis, SETH. Consider a Boolean Conjunctive Query Q. The best known algorithm for solving the problem "given D, is Q(D) true?" is computable in time \tilde O(N^{subw(Q)}), where N is the size of the largest relation, and subw(Q) is the submodular width of Q. Prove that this is also a lower bound. It suffices to restrict the proof to a small set of queries Q. Suggestions for queries to try (some of these may be solved in the literature): -- triangle: Q() = R(x,y) and S(y,z) and T(z,x). subw(Q)=fhtw(Q)=3/2 -- 4-cycle: Q() = R(x,y) and S(y,z) and T(z,u) and K(u,x). subw(Q)=3/2, fhtw(Q)=2 -- 4-clique: Q() = R(x,y),S(x,z),T(x,u),K(y,z),L(y,u),M(z,u). subw(Q)=6 Reference: Daniel Lokshtanov, Dániel Marx, Saket Saurabh: Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105: 41-72 (2011) Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu: What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another? PODS 2017: 429-444 -------- A systems implementation (and evaluation) project. Implement and perform an empirical evaluation of the performance of B^\varepsilon-trees. These are write-optimized trees, claimed to be better than LSM trees or B+ trees. In this project you will validate experimentally their claims. You may either use an existing implementation of B^\varepsilon trees (if one exists), then perform an extensive performance evaluation. Or, implement your own, and then it's OK if your performance evaluation is more limited. Reference: https://www.usenix.org/publications/login/oct15/bender -------- A database project, best fit for someone interested in statistics Variant 1. (clear goals and deliverables) Perform cardinality estimation using "correlation sampling". Extend the correlation sampling method from the reference below to handle multi-join queries. Experiment with it on (1) the IMDB dataset and queries (shown below), and (2) on some synthetic multi-join queries of your choosing. Variant 2. (a bit more theoretical and higher risk) Generalize the AKS sketch for multi-join queries. David Vengerov, Andre Cavalheiro Menck, Mohamed Zaït, Sunil Chakkappen: Join Size Estimation Subject to Filter Conditions. PVLDB 8(12): 1530-1541 (2015) https://imdbpy.sourceforge.io/docs/README.sqldb.txt https://github.com/gregrahn/join-order-benchmark -------- A data analysis project; open ended. Integrate the dblp data with citation data from arnet miner. Do some interesting analysis about the citation graph. References: http://dblp.uni-trier.de/xml/ https://aminer.org/data -------- An information extraction project. Choose the main conference of your field, and extract from the web the complete database of all program committee members from the first year when that conference was held until present. For example, we have such an extraction for the VLDB conference, in the assembla repository below. In this project you will construct a similar database of the organization of your preferred conference. The interesting part of this project is the lesson you learn to automate the process: build sufficient tools that could help you repeat this collection for another conference, or even for (say) all ACM sponsored conferences. svn co https://subversion.assembla.com/svn/vldb-bot/ -------- A systems project, consisting of experimental evaluation While distributed data processing holds the potential for improved scalability, the observation in the COST paper below is that several big data systems do this wrong: they introduce additional overhead to handle the distribution, then show that by increasing the number of servers the performance of the system improves. This is shocking finding. In this project, you verify that McSherry's observations are correct. Start by reproducing this finding from the 2017 blog (below), then extend your experiments to other systems/datasets. Frank McSherry, Michael Isard, Derek Gordon Murray: Scalability! But at what COST? HotOS 2015 https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md -------- A theoretical project. This projects concerns the optimal communication cost to compute a conjunctive query on a large database over p servers. For the join of two relations R \Join S, the optimal communication per server is: L = O( |Input| / p + \sqrt(|Output| / p) ) Generalize this to more complex conjunctive queries. Two hints: - most theoretical algorithms studied so far are hash-based. Try using sort-based algorithm instead. The reference below describes a simple sort-based algorithm for join; can it be generalized? - what is the general optimal load? This is an open question, but my guess is that the formula is: L = O( |Input| / p^{1/tau^*} + ( |Output| / p)^{1/\rho*} ) where \rho*, \tau* are the fractional edge covering number, and the fractional edge packing number of the query. In this project you will start from simple queries, study algorithm/lower bounds, then try to extend to more general queries. Reference: Algorithmic Aspects of Parallel Data Processing Paraschos Koutris, Semih Salihoglu, Dan Suciu (in press -- ask me for a copy) ----- A database / programming language project Implement incomplete databases and the "choice" operator, by compiling SQL to Rosette https://emina.github.io/rosette/ or to Alloy http://alloy.mit.edu/alloy/tutorials/online/ The goal is to extended SQL to support the following: -- Choice values; think of them as alternate choices for an attribute. For example: Product ------------------------------- | Name | Color | ------------------------------- | Toy | red or blue | | Car | red or green | | Truck| blue | | Train| red or blue or yellow | ------------------------------- The meaning is the color of Toy can be either 'red' or 'blue', leading to two sets of different worlds. In total there are 2*2*3=12 possible worlds, i.e. 12 possible database instances, hence this is an "incomplete database" Extend the 'insert' statement to: insert into Product values ('phone', 'blue' or 'green') -- Choice operator: extend sql with a choice operator: select Product.name, choice(Review.score) from Product, Review where Product.name = Review.name group by Product.name Here choice(....) selects nondeterministically one score for each review. -- Implement functionality over incomplete databases: compute the "certain answer"; compute "conditional certain answers"; etc. References: Jiewen Huang, Lyublena Antova, Christoph Koch, Dan Olteanu: MayBMS: a probabilistic database management system. SIGMOD Conference 2009: 1071-1074 Todd J. Green, Val Tannen: Models for Incomplete and Probabilistic Information. IEEE Data Eng. Bull. 29(1): 17-24 (2006) --------- This is an exploratory project, suitable for someone with some knowledge of Deep Learning. Design a query language for DNN (deep neural networks) Example, a classifier that decides whether to approve a mortgage based on (zip, profession, .... lots of attributes) Query 1: John was denied. Would John been approved had his race been different? Query 2: Is there any person for which changing the race would change the outcome? Query 3: Check if for every zipcode, there exists a person for which changing the race would change the decision. The goal of the project is to design the query language where such questions can be formulated, and implement it. The project is totally open ended, and quite high risk, since I won't be able to help on the ML side.