CSE 444 / CSEM 544

DATABASE SYSTEM INTERNALS


CSEM 544 READINGS

"Most current researchers [and probably you] were not around for many of the previous eras [of database development], and have limited (if any) understanding of what was previously learned. There is an old adage that he who does not understand history is condemned to repeat it. By presenting "ancient history", we hope to allow future researchers to avoid replaying history." - Stonebraker, Hellerstein

As part of 544M, we ask you to read 5 papers related to the material we cover in class. For each paper, we ask you to submit a 1- to 2-page (single-spaced) write-up that answers a few high-level questions about the paper.

The selected material corresponds to a graduate-level database course. In fact, we read these papers (and more) in the graduate 544 course. The papers collected here are originally sourced from the book "Readings in Database Systems", which is commonly referred to as the "Red Book" within the database community.

Each write-up will be graded as CREDIT/NO-CREDIT. To get credit, the write-up must demonstrate that you read the paper and that you reflected on it.

See course calendar for deadlines.

Paper 1: Data Models

"What Goes Around Comes Around" by Michael Stonebraker and Joseph Hellerstein [PDF] (focus on sections 1-4 and skim over the rest)

This article was originally published in the Red Book.

While reading this paper, focus on the following questions:

  • What is physical and logical data independence?
  • Briefly discuss physical and logical data independence in IMS, Codasyl, and the relational model.
  • Briefly discuss the different data models that followed the relational model. What was the goal of each model? Did it succeed or not? What are some reasons for this?
  • DUE Jan 11th

    Paper 2: DBMS Architecture

    "The Anatomy of a Database System" by Joseph Hellerstein and Michael Stonebraker [PDF] (focus on sections 1-4 and skim over the rest)

    This article was originally published in the Red Book.

    For this paper, we do not pose any specific questions. Please just write a summary of some of the key points in this paper. Make sure that your summary demonstrates that you reflected on the paper. So for example don't state things such as "Some systems use application-level threads while others use processes" but rather summarize the key advantages of each design choice.

    DUE Jan 29th

    Paper 3: Query Optimization

    Before reading this paper, you may want to read the book chapters on query optimization.

    "Access Path Selection in a Relational Database Management System" by P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price [PDF]

    This paper was originally published in the Proceedings of ACM SIGMOD, 1979.

    While reading this paper, focus on the following questions:

  • Query optimization is highly dependent on the effectiveness of cost estimation. How does the paper propose to compute the cost of a single relation access path? How about the cost of a complete query plan? What statistics are used? What happens when these statistics are not available for one relation? What are the benefits and limitations of this approach?
  • In addition to computing the cost of a query plan, a query optimizer also needs (1) to define the space of possible plans that it will search and (2) it needs an algorithm to enumerate possible query plans within that space. What query plans does the paper consider? What algorithm does the paper propose to find the best plan in that space? What are the benefits and limitations of this approach?
  • DUE Feb 12th

    Papers 4 & 5: Parallel Data Processing

    "Parallel Database Systems: The Future of High Performance Database Systems" by Dave DeWitt and Jim Gray [PDF] (focus on sections 1 and 2 only)

    "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat [PDF]

    These papers where originally published in Communications of the ACM, 1992 and OSDI, 2004, respectively.

    Please submit a single write-up for both papers (no more than 2 pages in length). In your write-up, please discuss the similarities and differences between parallel DBMSs and MapReduce systems.

    DUE Mar 12th