Notes on the Reading Assignments

Please do not repost or otherwise distribute the materials available from this website. Some material is available freely on the web, other is behind a paywall, other is private and we only have permission to use the material in class, not distribute.

Due date: Wed. March 31. Submit here.
A Case Against SQL

Explain briefly three peculiar behaviors of SQL

Due date: Wed. April 7. Submit here.
What goes around

Read sections 1-5 and 10. The other sections are not recommended and we will not discuss them in class.

Some suggested topics for discussion in your review:
- What is physical and logical data independence?
- Briefly discuss physical and logical data independence in IMS, Codasyl, and the relational model.
- Speculate what led to the decline of IMS / Codasyl and rise of the relational model.

Due date: Wed. April 14. Submit here.
PAX Read sections 1-4; we will discuss them in class. Sec. 5-7 are optional and will not be covered in class.

Column Store Read sections 1 and 2, skim over Sec. 3; we will read Sec 4 later (see below)

Some suggested topics for discussion in your review:
- Describe the main tradeoffs between row-store and column-store.
- Explain the significance of Fig. 1.2 in the column-store paepr.
- What is the difference between PAX and column-store? What are their pros and cons?

Due date: Wed. April 21. Submit here.
Query Compiler Read sections 1,2,3; skim over section 4 (in particular, check out Fig. 6, which should become clear).

Some topics to focus on (may use them in your review)
- Our focus in this paper is on the two query evaluation models: Volcano-style (pull-based) and data-driven (push-based). The most important piece is Fig. 3.
- The Futamura projection is less relevant to us (but a great concept to learn, given the opportunity).
- The Volcano model: is covered in all textbooks, and we cover it in class too.
- The data-driven model: is a new model (introduce in HyPer), which is well described in this paper
- How do these two models work? What is their difference?
- A separate (but important) topic is whether the query engine is interpreted or compiled. How was the original System R implemented? How are most commercial and open-source database systems implemented today? What are the pros and cons?
Optional: Vectorized v.s. Compiled Sections 1 and 2 are a short introduction to vectorized v.s. data driven query processing.

Due date: Wed. April 28. Submit here.
How good are they?

Read the entire paper. We will discuss most of it in class.

Due date: Mon. May 3. Submit here.
Column Store You have already started to read this paper. Now read Sections 4.1, 4.4., 4.5 (skim over the rest of Sec. 4).

In your review:
- Describe at least one technique from section 4

Due date: Wed. May 12. Submit here.
The case for learned indexes Read sections 1 and 2 (motivation and main idea). Only skim over sec. 3. Be aware that the technique described in section 3 and up is an overkill: read the blog:
Why use learning? (blog)

Optional: for newer, and simpler techniques that use regression in index structures, see here.
Due date: Mon. May 17. Submit here.
LSM (watch this short video first)

This is a very informative, but rather dense paper. Read and review sections 1,2,3. I recommend reading the rest of the paper too, but it is optional and we will not discuss it in class.

Optional: watch the conference presentation of this paper here start at 52:00.

Due date: Mon. May 24 Submit here.
MapReduce Read sections 1,2,3.

A major step backwards (blog)

Hive Read sections 1, 2, and skim through section 4 (focus on the optimizations)

Suggestions for discussion topics:
- How do these three papers fit gogether? What is the big story that they are telling?
- What are some advantages/disadvantages of MapReduce compared to parallel databases? How does MapReduce address skew? How does MapReduce address worker failures?
- Why was MapReduce needed in Facebook? Why was it insufficient? What are some limitations of HiveSQL? Why does it only support only INSERT OVERWRITE? What are some of the optimizations in Hive?

Due date: Wed. May 26 Submit here.
Snowflake

Read sections 1,2,3, skim over 4, and read Sec. 6. Suggestions for topics to address in your review:
- What is elasticity, why is it important, and how is it supported in Snowflake?
- How is data storage handled in Snowflake, and why? What would have been the alternatives?
- How are worker failures handled in Snowflake? How does this compare to MapReduce?
- How does snowflake handle semistructure data?