Lecture 12. Parallel data processing
Readings:
There are three papers for this lecture, but we only ask you to read a few subsections from each. If it still takes you more than three hours, scheme some of the sections quickly. Focus on the key ideas.
- Dave DeWitt and Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM. 1992. Also in Red Book 4th Ed. Sections 1 and 2 only. [pdf]
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Focus on sections 1 through 4. [pdf]
- C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Only the introduction. [pdf]
As you read these papers, discuss some of the similarities and differences between a parallel DBMS and Pig/MapReduce.
Lecture notes:
Additional resources:
- Chapter 22 (in R&G, third edition).
Optional, additional readings: