Please do not repost or otherwise distribute the materials
available from this website. Some material is available freely on
the web, other is behind a paywall, other is private and we only
have permission to use the material in class, not distribute.
-
Due date: Wed. March 31. Submit here.
A Case Against SQL
Explain briefly three peculiar behaviors of SQL
- Due date: Wed. April 7. Submit here.
What goes around
Read sections 1-5 and 10. The other sections are not
recommended and we will not discuss them in class.
Some suggested topics for discussion in your review:
- What is physical and logical data independence?
- Briefly discuss physical and logical data independence in IMS, Codasyl, and the relational model.
- Speculate what led to the decline of IMS / Codasyl and rise of the relational model.
- Due date: Wed. April 14. Submit here.
PAX Read sections 1-4; we
will discuss them in class. Sec. 5-7 are optional and will not
be covered in class.
Column Store Read sections 1 and 2, skim over Sec. 3; we will read Sec 4 later (see below)
Some suggested topics for discussion in your review:
- Describe the main tradeoffs between row-store and column-store.
- Explain the significance of Fig. 1.2 in the column-store paepr.
- What is the difference between PAX and column-store? What
are their pros and cons?
- Due date: Wed. April 21. Submit here.
Query Compiler Read
sections 1,2,3; skim over section 4 (in particular, check out
Fig. 6, which should become clear).
Some topics to focus on (may use them in your review)
- Our focus in this paper is on the two query evaluation
models: Volcano-style (pull-based) and data-driven
(push-based). The most important piece is Fig. 3.
- The Futamura projection is less relevant to us (but a great
concept to learn, given the opportunity).
- The Volcano model: is covered in all textbooks, and we
cover it in class too.
- The data-driven model: is a new model (introduce in HyPer),
which is well described in this paper
- How do these two models work? What is their difference?
- A separate (but important) topic is whether the query engine
is interpreted or compiled. How was the
original System R implemented? How are most commercial and
open-source database systems implemented today? What are the
pros and cons?
Optional: Vectorized v.s. Compiled Sections 1 and 2 are a short introduction to
vectorized v.s. data driven query processing.
- Due date: Wed. April 28. Submit here.
How good are they?
Read the entire paper. We will discuss most of it in class.
- Due date: Mon. May 3. Submit here.
Column Store You have already started to read this paper. Now read Sections 4.1,
4.4., 4.5 (skim over the rest of Sec. 4).
In your review:
- Describe at least one technique from section 4
- Due date: Wed. May 12. Submit here.
The case for learned indexes Read sections 1 and 2 (motivation and main idea). Only skim
over sec. 3. Be aware that the technique described in section
3 and up is an overkill: read the blog:
Why use learning? (blog)
Optional: for newer, and simpler techniques that use regression
in index structures,
see here.
- Due date: Mon. May 17. Submit here.
LSM
(watch this short video first)
This is a very informative, but rather dense paper. Read and
review sections 1,2,3. I recommend reading the rest of the paper
too, but it is optional and we will not discuss it in class.
Optional: watch the conference presentation of this
paper here
start at 52:00.
- Due date: Mon. May 24 Submit here.
MapReduce Read sections 1,2,3.
A major step backwards (blog)
Hive Read sections 1, 2, and skim through section 4 (focus on the optimizations)
Suggestions for discussion topics:
- How do these three papers fit gogether? What is the big story that they are telling?
- What are some advantages/disadvantages of MapReduce compared to parallel databases? How does MapReduce address skew? How does MapReduce address worker failures?
- Why was MapReduce needed in Facebook? Why was it insufficient? What are some limitations of HiveSQL? Why does it only support only INSERT OVERWRITE? What are some of the optimizations in Hive?
- Due date: Wed. May 26 Submit here.
Snowflake
Read sections 1,2,3, skim over 4, and read Sec. 6.
Suggestions for topics to address in your review:
- What is elasticity, why is it important, and how is it supported in Snowflake?
- How is data storage handled in Snowflake, and why? What would have been the alternatives?
- How are worker failures handled in Snowflake? How does this compare to MapReduce?
- How does snowflake handle semistructure data?