CSE 544M-Readings

Please note that this list may be updated

As part of 544M, we ask you to read 5 papers related to the material we cover in class. For each paper, we ask you to submit a 1- to 2-page (single-spaced) write-up that answers a few high-level questions about the paper. This material corresponds to a graduate level database course. In fact, we read these papers (and more) in the graduate 544 course.

Each write-up will be graded as CREDIT/NO-CREDIT. To get credit, the write-up must demonstrate that you read the paper and that you reflected on it.

See course calendar for deadlines.

Paper 1: Data Models

Michael Stonebraker and Joseph Hellerstein. What Goes Around Comes Around. In “Readings in Database Systems” (aka the Red Book). 4th ed Focus on Sec 1-4, skim over the rest. pdf

While reading this paper, try to focus on the following questions

What is physical and logical data independence?
Briefly discuss physical and logical data independence in IMS, Codasyl, and the relational model.
Briefly discuss the different data models that followed the relational model. What was the goal of each model? Did it succeed or not? What are some reasons for this?

Paper 2: DBMS Architecture

Joseph Hellerstein and Michael Stonebraker. The Anatomy of a Database System. In Red Book (4th ed). Focus on Sections 1-4 and skim the rest pdf.

For this paper, we do not post any specific questions. Please just write a summary of some of the key points in this paper. Make sure that your summary demonstrates that you reflected on the paper. So for example don’t state things such as “Some systems use application-level threads while others use processes” but rather summarize the key advantages of each design choice.

Paper 3: Query Optimization

Before reading this paper, you may want to read the book chapters on query optimization.

P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. Access Path Selection in a Relational Database Management System. Proceedings of ACM SIGMOD, 1979. Pages 22-34. Also in the Red Book (3rd ed and 4th ed) pdf

While reading this paper, try to focus on the following questions

Query optimization is highly dependent on the effectiveness of cost estimation. How does the paper propose to compute the cost of a single relation access path? How about the cost of a complete query plan? What statistics are used? What happens when these statistics are not available for one relation? What are the benefits and limitations of this approach?
In addition to computing the cost of a query plan, a query optimizer also needs (1) to define the space of possible plans that it will search and (2) it needs an algorithm to enumerate possible query plans within that space. What query plans does the paper consider? What algorithm does the paper propose to find the best plan in that space? What are the benefits and limitations of this approach?

Papers 4 & 5: Parallel data processing

Dave DeWitt and Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM. 1992. Also in Red Book 4th Ed. Sections 1 and 2 only. pdf

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. pdf

Please submit a single write-up for both papers (no more than 2 pages in length). In your write-up, please discuss the similarities and differences between parallel DBMSs and MapReduce systems.