Submit paper reviews here.

Date Description

Data Models

Sept 30

Lecture 1: Introduction and the Relational Model

Reading: None

Additional resources:

  • For each lecture, additional resources refer to extra readings that provide more information about the material we cover in class.
  • In the case of book chapters from the R&G book, the chapters listed provide a less detailed and more accessible overview than the papers. It is a good idea to take a look at these book chapters if the papers become confusing. You should still strive to get a depth of understanding at the level of the papers.
  • Chapters 1 and 3 through 5 in R&G book,
  • E.F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 1970 [pdf]
Oct 6

Lecture 2: SQL and Schema Normalization

Reading: None

Additional resources:

  • Chapter 4, "Relational Algebra and Calculus," and Chapter 5, "SQL DML" in R&G.
  • Chapter 19, "Schema Refinement and Normal Forms" in R&G.
  • Chapter 2, "Introduction to Database Design," and Chapter 3.5, "Logical Database Design: ER to Relational" in R&G
Oct 8

Lecture 3: Data Models: A Never-ending Story

Reading: Stonebraker and Hellerstein, "What Goes Around Comes Around." In "Readings in Database Systems" 4th ed. [pdf]. Read only sections 1-4 and 8-11.
Please focus on the following questions in your review:

  • What is physical and logical data independence?
  • Briefly discuss physical and logical data independence in IMS, Codasyl, and the relational model.
  • Speculate what led to the decline of IMS / Codasyl and rise of the relational model.

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Carey and DeWitt, "Of Objects and Databases: A Decade of Turmoil," VLDB 1996 [pdf] (Very interesting read for those who are interested in object databases).
  • Copeland and Maier, "Making Smalltalk a Database System," SIGMOD 1984 [pdf] (Origin of the term "impedance mismatch").
  • Cattell, "Scalable SQL and NoSQL Data Stores", SIGMOD Record, December 2010 (original publication date).
Oct 13

Lecture 4: Datalog

Reading: Hellerstein, "The Declarative Imperative," SIGMOD Record 2010 [pdf].
Please focus on the following questions in your review:

  • What is the urgency of parallelism?
  • How does the datalog approach to parallelism differ from other approaches?
  • If you were to build networking switches or distributed systems, would you use Datalog-like languages as described in the paper? Why or why not?

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • online tutorials on Datalog (in case you are not familiar with the syntax in order to understand the paper -- we will go through it in class).
  • Chapter 24 in R&G.

Query Execution

Oct 15

Lecture 5: DBMS Architecture and Indexing

Reading: Hellerstein and Stonebraker, "The Anatomy of a Database System." In "Readings in Database Systems" 4th ed. [pdf]. Read only sections 1-4 and skim the rest.
Please focus on the following question in your review:

  • What part(s) of the architecture described in the paper do you think is still relevant? What do you think have changed since the paper was written?

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Astrahan et al, "System R: Relational Approach to Database Management," ACM TODS 1(2), 1976. Pages 97-137. Read only the following: Section 1, Section 2 up to page 110, Section 3 up to page 122. Also in the Red Book (3rd ed). [pdf].
  • Chamberlin et al, "A History and Evaluation of System R," Communications of the ACM. 24(10): 632 - 646. 1981. Also in Red Book (3rd ed). [pdf].
  • Hellerstein et al, "Generalized Search Trees for Database Systems," VLDB 95, Pages 562-573. Also in Red Book (4th ed). Focus on Sections 1 through 4.2 [pdf].
Oct 20

Lecture 6: Lifecycle of a Query Plan

Reading: Sec. 4 from Hellerstein and Stonebraker, "The Anatomy of a Database System." In "Readings in Database Systems" 4th ed. [pdf], or Sec. 4 from "Architecture of a Database System" (slightly more detailed version of the Red book article) [pdf].

You don't need to turn in a review for this lecture.

Additional resources:

  • Shapiro, "Join processing in database systems with large main memories." ACM Transactions on Database Systems 11(3), 1986. Sections 1 and 2 only [pdf].
  • Chapters 12-14 in R&G.
  • Graefe, "Query Evaluation Techniques for Large Databases." ACM Computing Surveys 25(2), 1993 [pdf].
Oct 22

Lecture 7: Query Optimization

Reading: Selinger et al, "Access Path Selection in a Relational Database Management System." Proceedings of ACM SIGMOD, 1979. Pages 22-34 [pdf].
Please focus on the following questions in your review:

  • What are the main contributions and limitations of this paper?
  • Query optimization is highly dependent on the effectiveness of cost estimation. How does the paper propose to compute the cost of a single relation access path? How about the cost of a complete query plan? What statistics are used? What happens when these statistics are not available for one relation? What are the benefits and limitations of this approach?
  • In addition to computing the cost of a query plan, a query optimizer also needs (1) to define the space of possible plans that it will search and (2) it needs an algorithm to enumerate possible query plans within that space. What query plans does the paper consider? What algorithm does the paper propose to find the best plan in that space? What are the benefits and limitations of this approach?

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Chaudhuri, "An Overview of Query Optimization in Relational Systems," Proceedings of ACM PODS, 1998 [pdf].

Data Analytics

Oct 27

Lecture 8: Data Warehouses and Column Stores

Reading: Abadi et al, "The Design and Implementation of Modern Column-Oriented Database Systems," Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280) Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3) [pdf].
Please focus on the following questions in your review:

  • What are the differences between column and row oriented data stores?
  • Discuss at least one technique from Section 4.
  • What are column stores good for? (Hint: see background reading below)

Submit your paper review here (please use plain text or pdf).

Background on data analytics:

  • Chapter 25 in R&G.
  • Gray et al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals," Data Mining and Knowledge Discovery 1, p. 29-53 (1997) [pdf]
Oct 29

Lecture 9: Parallel Databases

Reading: DeWitt and Gray, "Parallel Database Systems: The Future of High Performance Database Systems," Communications of the ACM. 1992. Sections 1 and 2 only [pdf].
Please focus on the following questions in your review:

  • What are parallel databases good for?
  • How do parallel databases compare to column stores? Are they completely orthogonal?

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Chapter 22 in R&G.
  • DeWitt et al, "The Gamma Database Machine Project," IEEE Transactions on Knowledge and Data Engineering, Volume 2 Issue 1, March 1990, p. 44-62 [pdf].
Nov 3

Lecture 10: Map Reduce and Spark

Reading: DeWitt and Stonebraker, "MapReduce: A major step backwards," The Database Column, January 2008 (make sure you skim through the comments in addition to the article) [online article].

Reading: Zaharia et al, "Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing," Proceedings of NSDI 2012 Sections 2, 3, and 4 only [pdf].
Please focus on the following questions in your review:

  • Do you agree with the arguments made in the DeWitt and Stonebraker paper? Why or why not?
  • Is Spark just another Map Reduce in terms of being a programming model for data analytics? How do they differ?

Submit your paper review here (please use plain text or pdf).

Additional resources:

Transaction Processing and Recovery

Nov 5

Lecture 11: Transactions: Concurrency Control

Reading: Franklin, "Concurrency Control and Recovery," from The Handbook of Computer Science and Engineering, A. Tucker, ed., CRC Press, Boca Raton, 1997. [pdf].
We will go through this paper in two lectures. For this lecture, please read Sections 1, 2.1, and 3.1 (and please read the remaining sections for next lecture). While reading these sections, focus on the following questions in your review:

  • What is a transaction and what are the ACID properties?
  • What is serializability? What is a serializable schedule?
  • How does two-phase locking (2PL) work?
  • What is the "phantom problem"?
  • What are some benefits and drawbacks of providing the notion of a transaction in a DBMS?

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Chapter 16, 17 in R&G.
Nov 10

Lecture 12: Transactions: Concurrency Control (Part 2)

Reading: Franklin, "Concurrency Control and Recovery," from The Handbook of Computer Science and Engineering, A. Tucker, ed., CRC Press, Boca Raton, 1997. [pdf].
We will continue from the last lecture. You do not need to submit a paper review.

Nov 12

Lecture 13: Transactions: Recovery

Reading: Franklin, "Concurrency Control and Recovery," from The Handbook of Computer Science and Engineering, A. Tucker, ed., CRC Press, Boca Raton, 1997. [pdf].

For this lecture, please read Sections 2.2, and 3.2 of the article above. Since the details of failure recovery can be quite confusing, please submit any questions that you might have as paper review for this class.

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Chapter 18 in R&G.
Nov 19

Lecture 14: Distributed Transactions

Reading: Mohan et al, "Transaction Management in the R* Distributed Database Management System," ACM Transactions On Database Systems 11 (4), 1986. [pdf].

While this paper is a bit dated, the algorithm described is still one of the most widely used algorithms for distributed transactions. For your review, please focus on the following questions:

  • Why is a complex two-phase commit protocol needed when commiting a distributed transaction?
  • In the two-phase commit protocol, why do subordinates need to force-write a prepared log record before sending a YES VOTE? To answer this question, use an example failure scenario. Discuss what happens if a subordinate does NOT force-write the prepared log record, then discuss what happens if the subordinate does force-write the prepared log record.
  • Two-phase commit is a blocking protocol. When is it possible for nodes to be blocked?

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Chapter 22 in R&G.
Nov 24

Lecture 15: Replication and In-memory Transactions

Reading: Stonebraker et al, "The end of an Architectural Era: (It’s Time for a Complete Rewrite)." Proceedings of VLDB, 2007 [pdf].

We will finish our discussion on database replication from last time, and move on to discuss main memory databases.

You don't need to turn in a review for this lecture. But consider the arguments that the authors made in that paper and whether they are valid or not.

Advanced Topics

Dec 1

Lecture 16: Data Streams

Reading: Abadi et al, "Aurora: A New Model and Architecture for Data Stream Management." VLDB Journal, 12(2), 2003. Please read up to Section 5.2 [pdf].

Please focus on the following questions in your review:

  • What are the motivations for data streams?
  • How does SQuAl differ from standard relational algebra? How are they similar?

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Chandrasekaran et al, "TelegraphCQ: Continuous Dataflow Processing for an Uncertain World." CIDR 2003 [pdf].
  • The STREAM Group, "STREAM: The Stanford Stream Data Manager." IEEE Data Engineering Bulletin, March 2003 [pdf].
  • Meehan et al, "S-Store: Streaming Meets Transaction Processing." PVLDB 8(13), 2015 [pdf].
Dec 3

Lecture 17: Dealing with the Real World

Reading: Narasayya et al, "SQLVM: Performance Isolation in Multi-Tenant Relational Database-as-a-Service." CIDR 2013 [pdf].

Please focus on the following questions in your review:

  • Why is it important to be able to predict performance of each of the clients? Why is it a hard problem?
  • Describe one of the techniques mentioned in the paper for resource accounting.

Submit your paper review here (please use plain text or pdf).

Additional resources:

  • Elmore et al, "Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs." SIGMOD 2013 [pdf].
Dec 8

Lecture 18: NoSQL

Reading: Cattell, "Scalable SQL and NoSQL Data Stores." SIGMOD Record, December 2010 [pdf].

Please focus on the following questions in your review:

  • Name one aspect that NoSQL systems differ from traditional relational stores.
  • Is NoSQL just another data model as we discussed earlier in the quarter? Why or why not?

Submit your paper review here (please use plain text or pdf).