CSE 544: Lecture Notes and Reading Assignments

The drop box for paper reviews is https://catalyst.uw.edu/collectit/dropbox/mbalazin/33981

Lecture	Lecture Notes and Reading Assignments
Lecture 1	Title: Introduction and Review of Relational Model Readings: No readings assigned for the first lecture. No review to submit. Lecture notes: lecture1.pdf (the lecture notes will be available after each lecture). Additional resources: For each lecture, additional resources refer to extra readings that provide more information about the material we cover in class. In the case of book chapters from the R&G book, the chapters listed provide a less detailed and more accessible overview than the papers. It is a good idea to take a look at these book chapters if the papers become confusing. You should still strive to get a depth of understanding at the level of the papers. Chapters 1 and 3 through 5 in the R&G book. Optional, additional readings (we will not discuss them in class): Unlike "additional resources" that will directly help you understand the material we cover in class, "optional, additional readings" are only there if you are interested in a topic and would like to read more about it. E.F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 1970 (also in the Red Book 3rd Ed.) [pdf]
Lecture 2	Title: SQL and Schema Normalization Readings: No reading for this class. No paper review. Lecture notes: lecture2.pdf Additional resources: Chapters 4 and 5 in the R&G book. Chapters 2 and 19.1 through 19.7 in the R&G book.
Lecture 3	Title: Data Models Old and New Readings: *Old data models: Michael Stonebraker and Joseph Hellerstein. What Goes Around Comes Around. In "Readings in Database Systems" (aka the Red Book). 4th ed. [pdf] Read sections 1-4 and 8-11 (in other words skip 5-7). New data models*: No reading. We will go over them briefly here and will come back to NoSQL/NewSQL systems at the end of the quarter. While reading the paper, focus on the following questions: What is physical and logical data independence? Briefly discuss physical and logical data independence in IMS, Codasyl, and the relational model. Reflect on the data models that followed the relational model. What is/was the goal of each model? What are the benefits and challenges of each model? Lecture notes: lecture3.pdf Optional, additional readings: M.J. Carey, D.J DeWitt. Of Objects and Databases: A Decade of Turmoil. VLDB 1996. [pdf] Scalable SQL and NoSQL Data Stores by Rick Cattell, SIGMOD Record, December 2010 (Author keeps revising this survey).
Lecture 4	Title: DBMS Architecture and Data Storage Readings: Paper to review: Joseph Hellerstein and Michael Stonebraker. The Anatomy of a Database System. In Red Book (4th ed). Focus on Sections 1-4 and skim the rest [pdf]. The Hellerstein paper should be straightforward to read, so we are not posting any specific reading questions. Discuss a few interesting points form the paper in your review. Lecture notes: lecture4.pdf Additional resources: Chapter 41. Overview of PostgreSQL Internals. Optional, additional readings: Joseph Hellerstein, Michael Stonebraker, and James Hamilton. Architecture of a Database System. Foundations and Trends® in Databases. Volume 1. Issue 2. Michael Stonebraker. Operating System Support for Database Management. Communications of the ACM. Vol 24. Number 7. July 1981. Also in Red Book (3rd ed and 4th ed). [pdf] M.M. Astrahan et al. System R: Relational Approach to Database Management. ACM TODS 1(2), 1976. Pages 97-137. Read only the following: Section 1, Section 2 up to page 110, Section 3 up to page 122. Also in the Red Book (3rd ed). [pdf] A History and Evaluation of System R. Donald D. Chamberlin, Morton A. Astrahan, Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, Irving L. Traiger, Bradford W. Wade and Robert A. Yost. Communications of the ACM. 24(10): 632 - 646. 1981. Also in Red Book (3rd ed). [pdf]
Lecture 5	Title: Indexing Readings: You do NOT need to submit a review for this lecture. Just skim the following paper before coming: Joseph M. Hellerstein, Jeffrey F. Naughton and Avi Pfeffer. Generalized Search Trees for Database Systems. Proc. 21st Int'l Conf. on Very Large Data Bases, Zürich, September 1995, 562-573. Also in Red Book (3rd ed and 4th ed). Focus on Sections 1 through 4.2. [pdf] As you read the paper, consider the following questions: What are the main contributions of this paper? What are the key insights behind GiST? To extend the GiST, a user must implement 6 methods. What is the role of each method? When is each method used? What are the limitations of this paper? Lecture notes: lecture5.pdf Additional resources: Generally, chapters 8 through 11 (in the R&G book, third ed.) are all relevant to this lecture. Of particular interest: chapter 8 (pages 273-282) and chapter 10 (pages 344-358).
Lecture 6	Title: Query Execution and Operator Algorithms Readings: Leonard Shapiro. Join processing in database systems with large main memories. ACM Transactions on Database Systems 11(3), 1986. Sections 1 and 2 only [pdf] As you read this paper, consider the following questions: The paper presents four algorithms for computing the join of two relations when neither relation can fit entirely in main memory. What key techniques do these algorithms use? How do they differ? What assumptions underly the different algorithms and what can be done if these assumptions do not hold? Lecture notes: lecture6.pdf Additional resources: Chapters 12, 13, and 14(in R&G, third edition). Section 4.4 of Joseph Hellerstein and Michael Stonebraker. The Anatomy of a Database System. (see Lecture 4). Optional, additional readings: G. Graefe. Query Evaluation Techniques for Large Databases. ACM Computing Surveys 25(2), 1993. [pdf]
Lecture 7	Title: Query Optimization Readings: P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. Access Path Selection in a Relational Database Management System. Proceedings of ACM SIGMOD, 1979. Pages 22-34. Also in the Red Book (3rd ed and 4th ed). [pdf] While reading this paper, try to focus on the following questions What are the main contributions and limitations of this paper? Query optimization is highly dependent on the effectiveness of cost estimation. How does the paper propose to compute the cost of a single relation access path? How about the cost of a complete query plan? What statistics are used? What happens when these statistics are not available for one relation? What are the benefits and limitations of this approach? In addition to computing the cost of a query plan, a query optimizer also needs (1) to define the space of possible plans that it will search and (2) it needs an algorithm to enumerate possible query plans within that space. What query plans does the paper consider? What algorithm does the paper propose to find the best plan in that space? What are the benefits and limitations of this approach? Lecture notes: lecture7.pdf Additional resources: Chapter 15 (in R&G, third edition). Optional, additional reading: Surajit Chaudhuri. An Overview of Query Optimization in Relational Systems. PODS 1998. [pdf] The Volcano Optimizer Generator: Extensibility and Efficient Search. Goetz Graefe and William J. McKenna. ICDE 1993. [pdf]
Lecture 8	Title: Transactions: concurrency control Readings: You do NOT need to submit a review for this lecture. Just skim the following paper before coming: Michael J. Franklin. Concurrency Control and Recovery. The Handbook of Computer Science and Engineering, A. Tucker, ed., CRC Press, Boca Raton, 1997. [pdf] This paper is a very accessible overview of transactions. We will go over this paper in lectures 8 and 9. In lecture 8, we will focus on concurrency control. For this lecture, please read Sections 1, 2.1, and 3.1 (and please read the remaining sections for next lecture). While reading these sections, focus on the following questions What is a transaction and what are the ACID properties? What is serializability? What is a serializable schedule? How does two-phase locking (2PL) work? What is the "phantom problem" What are the four levels of isolation and what are the differences between them? What are some benefits and drawbacks of providing the notion of a transaction in a DBMS? Lecture notes: lecture8.pdf Additional resources: Chapters 16 and 17 (in R&G, third edition). Optional, additional reading:
Lecture 9	Title: Transactions: recovery Readings: You do NOT need to submit a review for this lecture. Just skim the following paper before coming: (Same reading as for last lecture) Michael J. Franklin. Concurrency Control and Recovery. The Handbook of Computer Science and Engineering, A. Tucker, ed., CRC Press, Boca Raton, 1997. [pdf] For this lecture, please read Sections 2.2 and 3.2. The details of failure recovery can get quite confusing. Every time you get confused, write down your question and bring it to class. We will walk through the principles of failure recovery and the details of the ARIES method. Lecture notes: lecture9.pdf Additional resources: Chapter 18 (in R&G, third edition).
Lecture 10	Title: Data Warehousing with Column Stores Readings: The Design and Implementation of Modern Column-Oriented Database Systems Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280) Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3). As you read this paper consider the following questions: What are the differences between a column- and a row- oriented DBMS? Discuss at least one technique from Section 4. Lecture notes: lecture10.pdf Additional resources: Chapter 25 (in R&G, third edition). Optional, additional reading: C-Store: A Column-oriented DBMS. Michael Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, Stan Zdonik. In Proceedings of VLDB, 2005. Teaching an Old Elephant New Tricks. Nicolas Bruno (Microsoft). CIDR 2009. Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. SIGMOD 2008.
Lecture 11	Title: Data analytics at scale: Parallel DBMSs and MapReduce Readings: There are two papers for this lecture, but we only ask you to read a few subsections from each. As you read the papers, focus on the key ideas. Dave DeWitt and Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM. 1992. Also in Red Book 4th Ed. Sections 1 and 2 only. [pdf] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Focus on sections 1 through 4. [pdf] As you read these papers, discuss some of the similarities and differences between a parallel DBMS and MapReduce. Lecture notes: lecture11.pdf Additional resources: Chapter 22 (in R&G, third edition). Optional, additional readings: There are many systems being developed in this space: Hadoop, Impala, Hive, Scope, ...
Lecture 12	Title: In-memory data analytics at scale: Spark Readings: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI'12. As you read this paper, discuss what Spark brings beyond the capabilities previously offered by MapReduce/Hadoop and parallel database management systems. When/why should users switch to this new engine? Lecture notes: lecture12.pdf (no lecture notes) Additional resources: Chapter 22 (in R&G, third edition). Optional, additional readings: There are many systems being developed in this space: Hadoop, Impala, Hive, Scope, ...
Lecture 13	Title: Distributed transactions and replication You do NOT need to submit a review for this lecture. Just skim the following paper before coming: Readings: C. Mohan, B. Lindsay, and R. Obermarck. Transaction Management in the R* Distributed Database Management System. ACM Transactions On Database Systems 11 (4), 1986. Also in the Red Book (3rd and 4th ed). As you read this paper, consider the following questions: Why is a complex two-phase commit protocol needed when commiting a distributed transaction? In the two-phase commit protocol, why do subordinates need to force-write a prepared log record before sending a YES VOTE? To answer this question, use an example failure scenario. Discuss what happens if a subordinate does NOT force-write the prepared log record, then discuss what happens if the subordinate does force-write the prepared log record. Two-phase commit is a blocking protocol. When is it possible for nodes to be blocked? Lecture notes: lecture13.pdf Optional, additional readings: Chapter 22 (in R&G, third edition).
Lecture 14	Title: NoSQL You do NOT need to submit a review for this lecture. Just skim the following paper before coming: Readings: Scalable SQL and NoSQL Data Stores by Rick Cattell, SIGMOD Record, December 2010 (Author keeps revising this survey). As you read this paper, consider what distinguishes NoSQL systems from traditional relational database management systems. Lecture notes: lecture14.pdf Optional, additional readings:
Lecture 15	Title: In-memory transactions (H-Store) Readings: The end of an Architectural Era: (It’s Time for a Complete Rewrite), M. Stonebraker et. al. VLDB ’07 As you read this paper, consider the new design decisions that the authors propose. Lecture notes: lecture15.pdf Optional, additional readings:
Lecture 16	Title: Spanner (guest speaker Campbell Fraser from Google) Readings: James C. Corbett et. al. Spanner: Google's globally-distributed database. In OSDI'12. As you read this paper, consider what distinguishes Spanner from traditional distributed transactions systems and NoSQL systems. Lecture notes: Video of another talk about Spanner. Optional, additional readings:
Lecture 17	Title: Databases as a cloud service Readings: Aaron J. Elmore, Sudipto Das, Alexander Pucher, Divyakant Agrawal, Amr El Abbadi, and Xifeng Yan. 2013. Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). As you read this paper, think about the new challenges that arise when offering a DBMS as a cloud service. Discuss the challenges addressed in the paper. Lecture notes: lecture17.pdf Optional, additional readings: