CSEP 544: Reading Assignments

The drop box for paper reviews is https://catalyst.uw.edu/collectit/dropbox/suciu/30077

Due Date (before class)
Paper
Sections to Read, Suggestions for Questions to Address
January 14
Stonebraker's blog on "Big Data" (csenetid, uwnetid)
January 21
PAX Paper (csenetid, uwnetid)
Sections 1-4 only
January 28
Join processing in database systems with large main memories (csenetid, uwnetid)

Sections 1,2 only

  • What is the "fudge factor"?
  • The main memory hash-join algorithm (called classic hashing in the paper) stores one relation in main memory. What happens to the algorithm if the amount of main memory is smaller than the size of the relation?
  • What is grace-hash-join?
  • What is hybrid-hash-join?
February 4
Stonebraker's blog on MapReduce v.s. Relational Databases (csenetid, uwnetid)
Add a short comment to the review of the main paper
MapReduce (csenetid, uwnetid)

Sections 1,2,3,4 only. To answer the questions below you may need to do Homework 3 first.

  • What is the difference between a map job and a map task?
  • How is the number of map tasks determined? How is the number of reduce tasks determined?
  • What are the pros/cons in choosing a large/small number of map tasks and reduce tasks?
  • A reduce task has three phases: copy, sort, and reduce. Which of these phases may be started while the map tasks are still running, and which phases cannot be started until all map tasks have completed?
Optional reading: Mining of Massive Datasets, by Rajaraman and Ullman

Sections 2 only.

February 11

What goes around comes around (csenetid, uwnetid)

Sections 1, 2, 3, 4, 8, 9, 10, 11 (in other words, skip 5-7)
February 25
The Declarative Imperative (csenetid, uwnetid)

Sections 1,2,3 only

  • What is the urgency of parallelism?
  • How does the datalog approach to parallelism differ from other approaches?
  • What are the three types of rules in Dedalus? See Fig. 1.
March 11
Stonebraker's blog on NoSQL (csenetid, uwnetid)
Add a short comment to the review of the main paper
Column-store Databases (csenetid, uwnetid)

Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others)

  • What are the main tradeoffs in column-store databases (i.e. when are they better, and when are they worse than row-oriented databases)?
  • What is the advantage of compression in databases?
  • What is "late materialization"?
  • Explain this statement: "compared to 1980 disk I/O looks 100 times slower to the CPU".