CSEP590D: Reading Assignments

April 7: Dynamic programming

Paper link: here
Review link: here
Suggested discussion points:
- Recap: what are the main assumptions used by cardinality estimators?
- Why are industry benchmarks like TPC/H inadequate for evaluating cardinality estimators?
- What is the difference between a cardinality estimator and a cost estimator? How are they related? What does the paper sayabout their relative importance?
- Discuss the impact of the search space on query optimization. (This is discussed in Sec. 6)

Paper link: here
Review link: here
This website has many useful resources, including conference talks (I recommend OOPSLA 2017 TALK ON TACO) and a web interface for the compiler.
Suggested discussion points:
- TACO automatically generates code implementing tensor algebra kernels (what is a kernel?). Why is it impractical to hand-code a library for all these kernels?
- Suppose T is a 5x5x5 tensor with a single entry of value 1 at coordinate (2, 2, 2). How does TACO represent T if each dimension is dense? And what if each dimension is sparse?
- Try to write out (by hand) the pseudo code for multiplying two sparse vectors point-wise, namely z_i = x_i*y_i. It should be similar, but simpler, than the code in Fig.9. It looks like a certain join algorithm - which one is it?
- TACO compiles small tensor kernels with a handful operations. Can it handle entire tensor programs with thousands of operations? If not, what's the challenge?

Choose between one of these two papers (you only need to review one paper):

Paper 1 link: here
Paper 2 link: here
Review link (same form, no matter which paper you choose): here
Skim over the paper (it's very high level)
Suggested discussion points if you choose paper 1 (Redshift). Skim over the paper (it's very high level), then reflect on:
- What does an redshift cluster consist of?
- What types of data partition does redshift support? (The guest lecture is likely to discuss these, but it's easy to imagine what they do.)
- What was the key metric for the redshift design team?
- How long (seconds or minutes) did it typically take to launch a redshift cluster?
Suggested discussion points if you choose paper 2 (Query Compiler). This is a much denser paper, a lot to learn if you are patient.
- Our focus in this paper is on the two query evaluation models: Volcano-style (pull-based) and data-driven (push-based). The most important piece is Fig. 3.
- The Futamura projection is less relevant to us (but a great concept to learn, given the opportunity).
- The Volcano iterator model: is covered in all textbooks, and we covered it in class too.
- The data-driven model: is a new model (introduce in HyPer), which is well described in this paper
- How do these two models work?
- What is their difference?
- A separate (but important) topic is whether the query engine is interpreted or compiled. How was the original System R implemented? How are most commercial and open-source database systems implemented today? What are the pros and cons?

Paper link: here
Review link: here
Read sections 1, 2, skim over Sec. 3, then read sections 4.1, 4.4., 4.5
Suggested discussion points:
- What are the differences between column and row oriented data stores?
- Discuss at least one technique from Section 4.
- What are column stores good for?
- Argue in favor or against: caching compiled code is easier for column store than row store.

Paper link: here
Review link: here
Photon is an example of the latest data management architecture called a Lakehouse. This paper will appear in SIGMOD'2022. You can skip Section 5.
Suggested questions to consider in your review (as usual, these are only suggestions):
- What were the main design decisions that went into Photon?
- How does Photon differ from DBR (Databricks Runtime)?
- What motivated the team to use a native language (C++) instead of JVM?
- What are the pros and cons of vectorized query processing on one hand, and compiled queries on the other hand, according to the paper?
- Are there any takeaways for you from the experimental section?
- This requires a bit of research. Snowflake is also offering a Lakehouse solution. What are the differences (if any)? (I personally don't know the answer.)

Paper link: here
Review link:here
Read sections 1,2,3, skim over 4, and read sec. 6
Suggested discussion topics:
- What is elasticity, why is it important, and how is it supported in Snowflake?
- How is data storage handled in Snowflake, and why? What would have been the alternatives?
- How are worker failures handled in Snowflake?
- How does snowflake handle semistructure data?

Please choose any database related paper to review.
Review link: here
Suggestions for what to choose:
- Choose a paper from the "suggested readings" below; or
- Choose a paper that is relevant to your job, or to your interests. Maybe a paper that you have read before and liked; maybe a paper that you wanted to read didn't have time until now.
- The paper can be a peer reviewed paper, or a technical document, or even an extensive blog post. The only criteria is that it has to be technical and related (even vaguely) to database or data management
Please include a link to the paper at the beginning of your review (so I know what you read).