CSEP 544: Database Management Systems -- Mini-project
Overview
The goal of the mini-project is to allow
you to experiment on your own with some techniques or
systems discussed in class. The workload should be
comparable to one homework assignment.
Organization
This is an individual project, not a group project.
Deliverables
- Project proposal (1/2 page max): May 21st, 2021
- Final report (3 pages max, including graphs): June 8,
2021
Submit by placing the pdf file in the project
directory of your Gitlab repository.
Topic
You will choose the topic and will submit a
short project proposal. Limit the scope, because you have
only a couple of weeks to work on it.
Choosing a topic: method 1
Choose a
topic inspired by one of the papers from the reading
assignments.
For some papers, an open source system is available; for
others, no code is available and you may have to implement
some algorithm in the paper.
Choose to validate one or more of their experiments; or
try a variation; or compare their technique/system with some
other techniques/systems that you are familiar with.
Some random ideas (feel free to adapt/extend/or totally
ignore and choose something else):
- PAX paper. Does performance really increase when
switching from NSM to PAX? Implement a very simple
selection query (e.g. in C++) over a row-representation
and a column-representation of a table (main memory only).
Vary the block lengths; vary the record/attribute lengths;
vary the size of the relation, show the graph of the
runtime, draw conclusions.
- C-Store paper: evaluate the performance improvements
of some of the techniques in Sec. 4. Implement a simple
join query (main memory only) in several ways: over
compressed or uncompressed data; standard join / jive join
/ late materialization. Show the runtime as a function of
the data size. Notice that compression may slow down your
query, but can result in better performance if the data
were fetched from disk.
- How-good-are-they paper: how good are the cardinality
estimators in some of the systems you know? Upload the
JOB benchmark, check the cardinality estimates
using explain or some other systems-related
command, compare it with the true cardinality. How does
the error depend on the complexity of the query? On the
selection predicates? On whether you constructed
histograms? Does it differ across systems (e.g. postgres
v.s. SQL Server v.s. Snowflake v.s. Redshift)? Explain
your findings. Related to this: compare the query plans
produced by different systems, especially for queries with
multiple joins, explain your findings. It's easy to get
carried away here: be careful to choose a topic that you
can do in 2 weeks or so.
- Learned indexes: do their claims hold? Ignore the
multi-layer DNNs in their original paper, but check if a
reasonable regression model can out-perform a B+ tree.
You may find open source code for both learned indexes
and for B+ trees, feel free to experiment with those.
Notice that there exists main-memory versions of
B+-trees (cache-conscious),
see here,
it would be more interesting to compare learned indexes
with these.
- LSM trees: here I have several suggestions. The first
is to find an LSM open source package and experiment with
that: compare it with a B+-tree implementation, and try
adjusting various parameters, then report the performance
(for both reads and writes) as a function of the size of
the data. The second is to compare LSM trees with
B-epsilon trees,
see here
(again: look for an open source implementation). Finally,
one option is to dig deeper into the Monkey paper and
replicate some of their experiments (it's OK to use their
code, if available).
- Snowflake. There are several cloud-based, distributed
database systems: Snowflake, Redshift, Azure Cosmos DB,
BigQuery. An interesting mini-project could be to compare
their features: this would require lots of reading of the
manuals, and very little implementation: a succinct yet
well-informed comparison could be very useful.
Alternatively, you could dig deeper and compare the
performance of any two cloud databases (no more than two
because you will not have time), for example on TPC/H or
on the JOB benchmark. Finally, you could pick a single
cloud database, and dig deeper into their implementation
of distributed joins. Does it support broadcast join?
Hash-partition join? When does it choose one join method
over the other? How well does it cope with skew in the
data? Does it push aggregates past the joins (it should,
but you never know)?
Choosing a topic: method 2
Choose any other
exploration project that is of interest to you. For
example, if you are exploring a new database-related
technology at your job, your mini-project could consist of
evaluating and benchmarking that technology.
Project Proposal: Details
1/2 page, in pdf format. Please include the following:
- Your name (please use the name you used in Canvas)
- A title of your project
- Short description of what you want to do (could be as
short as 2-3 sentences)
- What system are you planning to use? And do you have
access to it? E.g. you are planning to use Redshift and
already have AWS credits; or are planning to use Redshift
but don't know yet whether you can get credits; or will
use an open-source implementation of LSM tree and have
already found one (or haven't found one yet); or plan to
implement such and such in Java/C++/Rust/whatever.
- What data are you planning to use? Same here: tell us
if you have access to the data, have looked at it, and
whether you have checked that you can get it and use it
for your mini-project.
- What are you planning to report? E.g. you plan to
show two graphs, the first is the runtime as a function of
the data size, the second is a graph showing the runtime
as a function of the record size. If you are less sure,
you can be more vague, but do make an effort to think
about what you would like to report.
- References: cite any papers that you are planning to use in your mini-project.
You may change the plan in your proposal: we do not
enforce the final report to follow the proposal strictly.
The goal of the proposal is to force you to start thinking
about the project, and it's OK if you change your mind
later.
Final Report: Details
3 pages, in pdf format. Suggested structure (it may not
apply to all mini-projects):
- Your name (please use the name you used in Canvas)
- A title of your project
- Describe the problem.
- Describe the system(s).
- Describe the data.
- Results: 2-3 graphs.
- Discussion of the results: this is the most
interesting part of your min-project.
- References: cite any papers that you used in your mini-project.