CSEP590D: Reading Assignments
April 7: Dynamic programming
- Paper
link: here
- Review
link: here
- Suggestions:
- Skip the hairy math, i.e. sections 2.1, 2.2, 2.4
- What are the shortcomings of previous approaches that this paper
addresses?
- What are assumptions made by their new algorithm??
- A subtle bug was reported in algorithm EnumerateCmp. What is
that bug?
- Excited about join reordering? Implement their algorithm, and
produce some number as in Fig 3.
April 14: Cost Estimation
- Paper
link: here
- Review
link: here
- Suggested discussion points:
- Recap: what are the main assumptions used by cardinality estimators?
- Why are industry benchmarks like TPC/H inadequate for evaluating cardinality estimators?
- What is the difference between a cardinality estimator and a cost estimator? How are they related? What does the paper sayabout their relative importance?
- Discuss the impact of the search space on query optimization. (This is discussed in Sec. 6)
April 21: Compiling and Optimizing Tensor Programs
- Paper
link: here
- Review
link: here
- This website
has many useful resources, including conference talks (I recommend OOPSLA 2017 TALK ON TACO)
and a web interface for the compiler.
- Suggested discussion points:
- TACO automatically generates code implementing tensor algebra kernels (what is a kernel?).
Why is it impractical to hand-code a library for all these kernels?
- Suppose T is a 5x5x5 tensor with a single entry of value 1 at coordinate (2, 2, 2).
How does TACO represent T if each dimension is dense?
And what if each dimension is sparse?
- Try to write out (by hand) the pseudo code for multiplying two sparse vectors point-wise,
namely zi = xi*yi.
It should be similar, but simpler, than the code in Fig.9.
It looks like a certain join algorithm - which one is it?
- TACO compiles small tensor kernels with a handful operations.
Can it handle entire tensor programs with thousands of operations?
If not, what's the challenge?
April 28: Amazon Redshift OR Query Compiler
Choose between one of these two papers (you only need to review one paper):
- Paper 1
link: here
- Paper 2 link: here
- Review link (same form, no matter which paper you choose): here
- Skim over the paper (it's very high level)
- Suggested discussion points if you choose paper 1 (Redshift).
Skim over the paper (it's very high level), then reflect on:
- What does an redshift cluster consist of?
- What types of data partition does redshift support? (The guest
lecture is likely to discuss these, but it's easy to imagine what
they do.)
- What was the key metric for the redshift design team?
- How long (seconds or minutes) did it typically take to launch a
redshift cluster?
- Suggested discussion points if you choose paper 2 (Query
Compiler). This is a much denser paper, a lot to learn if you are
patient.
- Our focus in this paper is on the two query evaluation models:
Volcano-style (pull-based) and data-driven (push-based). The most
important piece is Fig. 3.
- The Futamura projection is less relevant to us (but a great concept to
learn, given the opportunity).
- The Volcano iterator model: is covered in all textbooks, and we covered it in
class too.
- The data-driven model: is a new model (introduce in HyPer), which is
well described in this paper
- How do these two models work?
- What is their difference?
- A separate (but important) topic is whether the query engine is
interpreted or compiled. How was the original System R
implemented? How are most commercial and open-source database
systems implemented today? What are the pros and cons?
May 5: Column Store
- Paper
link: here
- Review
link: here
- Read sections 1, 2, skim over Sec. 3, then read sections 4.1,
4.4., 4.5
- Suggested discussion points:
- What are the differences between column and row oriented data
stores?
- Discuss at least one technique from Section 4.
- What are column stores good for?
- Argue in favor or against: caching compiled code is easier for
column store than row store.
May 12: Photon
- Paper
link: here
- Review
link: here
- Photon is an example of the latest data management architecture
called a Lakehouse. This paper will appear in SIGMOD'2022.
You can skip Section 5.
- Suggested questions to consider in your review (as usual, these
are only suggestions):
- What were the main design decisions that went into
Photon?
- How does Photon differ from DBR (Databricks Runtime)?
- What motivated the team to use a native language (C++)
instead of JVM?
- What are the pros and cons of vectorized query processing on
one hand, and compiled queries on the other hand, according to
the paper?
- Are there any takeaways for you from the experimental
section?
- This requires a bit of research. Snowflake is also offering
a Lakehouse solution. What are the differences (if any)? (I
personally don't know the answer.)
May 19: Snowflake
- Paper
link: here
- Review
link:here
- Read sections 1,2,3, skim over 4, and read sec. 6
- Suggested discussion topics:
- What is elasticity, why is it important, and how is it supported in
Snowflake?
- How is data storage handled in Snowflake, and why? What would have
been the alternatives?
- How are worker failures handled in Snowflake?
- How does snowflake handle semistructure data?
May 27 (this is one day later than usual): Review a Paper of your Choice
- Please choose any database related paper to review.
- Review
link: here
- Suggestions for what to choose:
- Choose a paper from the "suggested readings" below; or
- Choose a paper that is relevant to your job, or to your
interests. Maybe a paper that you have read before and liked; maybe
a paper that you wanted to read didn't have time until now.
- The paper can be a peer reviewed paper, or a technical document,
or even an extensive blog post. The only criteria is that it has
to be technical and related (even vaguely) to database or data management
- Please include a link to the paper at the beginning of your
review (so I know what you read).
Other Suggest Readings
- Papers that discuss how to extend dynamic programming to
handle outer joins and semi
joins: here
and here.
The second paper may be a bit easier to read.
- An overview paper of Cockroach
Labs: here.
It may be of less interest to most students in class, because it
emphasizes their approach to distributed transactions, while we
don't discuss distributed transactions, instead we discuss their
query optimizer.
- Push v.s. pull execution
model: here.
Traditional query engines are interpreted, and that worked
fine because, traditionally, most of the execution cost comes from
the disk I/O, so the interpretation overhead is tiny. These engines
use the Volcano, pull-based iterator model. Today's main memories
are large enough to fit most database instances, and newer
engines compile the query plan into machine code. Here the
pull-based model is suboptimal, and instead they deploy
a push-based model, which is nicely explained in this paper.
- The Cascades
Framework: here.
This is the original paper that described the architecture of the
rule-base query optimization engine; SQL Server, Cockroach Labs and
others have adapted it in one way or another. The paper is high
level, and not an easy read. We will get a better understanding of
the Cascades framework from some of the invited lectures.
- Papers about the HyPer database
engine: here
and here.
Fun to read and informative. The second paper (on unnesting) will
likely become a required reading.
- A random recent paper that I happen to
like here. It
describes a nice way to extend a database system with user defined functions.