CSEP 544: Database Management Systems -- Mini-project

Overview

The goal of the mini-project is to allow you to experiment on your own with some techniques or systems discussed in class. The workload should be comparable to one homework assignment.

Organization

This is an individual project, not a group project.

Deliverables

Project proposal (1/2 page max): May 21st, 2021
Final report (3 pages max, including graphs): June 8, 2021

Submit by placing the pdf file in the project directory of your Gitlab repository.

Topic

You will choose the topic and will submit a short project proposal. Limit the scope, because you have only a couple of weeks to work on it.

Choosing a topic: method 1

Choose a topic inspired by one of the papers from the reading assignments.

For some papers, an open source system is available; for others, no code is available and you may have to implement some algorithm in the paper.

Choose to validate one or more of their experiments; or try a variation; or compare their technique/system with some other techniques/systems that you are familiar with.

Some random ideas (feel free to adapt/extend/or totally ignore and choose something else):

PAX paper. Does performance really increase when switching from NSM to PAX? Implement a very simple selection query (e.g. in C++) over a row-representation and a column-representation of a table (main memory only). Vary the block lengths; vary the record/attribute lengths; vary the size of the relation, show the graph of the runtime, draw conclusions.
C-Store paper: evaluate the performance improvements of some of the techniques in Sec. 4. Implement a simple join query (main memory only) in several ways: over compressed or uncompressed data; standard join / jive join / late materialization. Show the runtime as a function of the data size. Notice that compression may slow down your query, but can result in better performance if the data were fetched from disk.
How-good-are-they paper: how good are the cardinality estimators in some of the systems you know? Upload the JOB benchmark, check the cardinality estimates using explain or some other systems-related command, compare it with the true cardinality. How does the error depend on the complexity of the query? On the selection predicates? On whether you constructed histograms? Does it differ across systems (e.g. postgres v.s. SQL Server v.s. Snowflake v.s. Redshift)? Explain your findings. Related to this: compare the query plans produced by different systems, especially for queries with multiple joins, explain your findings. It's easy to get carried away here: be careful to choose a topic that you can do in 2 weeks or so.
Learned indexes: do their claims hold? Ignore the multi-layer DNNs in their original paper, but check if a reasonable regression model can out-perform a B+ tree. You may find open source code for both learned indexes and for B+ trees, feel free to experiment with those. Notice that there exists main-memory versions of B+-trees (cache-conscious), see here, it would be more interesting to compare learned indexes with these.
LSM trees: here I have several suggestions. The first is to find an LSM open source package and experiment with that: compare it with a B+-tree implementation, and try adjusting various parameters, then report the performance (for both reads and writes) as a function of the size of the data. The second is to compare LSM trees with B-epsilon trees, see here (again: look for an open source implementation). Finally, one option is to dig deeper into the Monkey paper and replicate some of their experiments (it's OK to use their code, if available).
Snowflake. There are several cloud-based, distributed database systems: Snowflake, Redshift, Azure Cosmos DB, BigQuery. An interesting mini-project could be to compare their features: this would require lots of reading of the manuals, and very little implementation: a succinct yet well-informed comparison could be very useful. Alternatively, you could dig deeper and compare the performance of any two cloud databases (no more than two because you will not have time), for example on TPC/H or on the JOB benchmark. Finally, you could pick a single cloud database, and dig deeper into their implementation of distributed joins. Does it support broadcast join? Hash-partition join? When does it choose one join method over the other? How well does it cope with skew in the data? Does it push aggregates past the joins (it should, but you never know)?

Choosing a topic: method 2

Choose any other exploration project that is of interest to you. For example, if you are exploring a new database-related technology at your job, your mini-project could consist of evaluating and benchmarking that technology.

Project Proposal: Details

1/2 page, in pdf format. Please include the following:

Your name (please use the name you used in Canvas)
A title of your project
Short description of what you want to do (could be as short as 2-3 sentences)
What system are you planning to use? And do you have access to it? E.g. you are planning to use Redshift and already have AWS credits; or are planning to use Redshift but don't know yet whether you can get credits; or will use an open-source implementation of LSM tree and have already found one (or haven't found one yet); or plan to implement such and such in Java/C++/Rust/whatever.
What data are you planning to use? Same here: tell us if you have access to the data, have looked at it, and whether you have checked that you can get it and use it for your mini-project.
What are you planning to report? E.g. you plan to show two graphs, the first is the runtime as a function of the data size, the second is a graph showing the runtime as a function of the record size. If you are less sure, you can be more vague, but do make an effort to think about what you would like to report.
References: cite any papers that you are planning to use in your mini-project.

You may change the plan in your proposal: we do not enforce the final report to follow the proposal strictly. The goal of the proposal is to force you to start thinking about the project, and it's OK if you change your mind later.

Final Report: Details

3 pages, in pdf format. Suggested structure (it may not apply to all mini-projects):

Your name (please use the name you used in Canvas)
A title of your project
Describe the problem.
Describe the system(s).
Describe the data.
Results: 2-3 graphs.
Discussion of the results: this is the most interesting part of your min-project.
References: cite any papers that you used in your mini-project.