CSEP 544: Reading Assignments

The drop box for paper reviews is https://catalyst.uw.edu/collectit/dropbox/suciu/36474

Due Date (before class)
Paper
Sections to Read, Suggestions for Questions to Address
Oct. 12
Stonebraker's blog on "Big Data" (csenetid, uwnetid)

What goes around comes around (csenetid, uwnetid)

Sections 1, 2, 3, 4, 8, 9, 10, 11 (in other words, skip 5-7)
Oct. 19
PAX Paper (csenetid, uwnetid)
Sections 1-4 only
Oct. 26
Selectivity estimation: conference version (csenetid, uwnetid) or journal version (csenetid, uwnetid)

Choose between the conference version (short) or the journal version (more complete)

  • Why do we care about selectivity estimation (estimating the number of tuples that are returned by a query)?
  • What are the three original simplifying assumptions regarding tuple selectivity?
  • Can you think of example real-world datasets where these assumptions are inaccurate?
  • Why not just store multi-dimensional histograms?
  • What is conditional independence and how does a bayesian network represent conditional independence?
  • How can the author's guarantee that they only need to maintain two dimensional distributions?
  • What did you think of their experimental section? Did they show the applicability of their method? Any flaws?
Nov. 02
Join processing in database systems with large main memories (csenetid, uwnetid)

Sections 1,2 only

  • What is the "fudge factor"?
  • The main memory hash-join algorithm (called classic hashing in the paper) stores one relation in main memory. What happens to the algorithm if the amount of main memory is smaller than the size of the relation?
  • What is grace-hash-join?
  • What is hybrid-hash-join?
Stonebraker's blog on MapReduce v.s. Relational Databases (csenetid, uwnetid)
Add a short comment to the review of the main paper
Nov. 9
MapReduce (csenetid, uwnetid)

Sections 1,2,3,4 only. To answer the questions below you may need to do Homework 3 first.

  • What is the difference between a map job and a map task?
  • How is the number of map tasks determined? How is the number of reduce tasks determined?
  • What are the pros/cons in choosing a large/small number of map tasks and reduce tasks?
  • A reduce task has three phases: copy, sort, and reduce. Which of these phases may be started while the map tasks are still running, and which phases cannot be started until all map tasks have completed?
Optional reading: Mining of Massive Datasets, by Rajaraman and Ullman

Sections 2 only.

Nov. 30
The Declarative Imperative (csenetid, uwnetid)

Sections 1,2,3 only

  • What is the urgency of parallelism?
  • How does the datalog approach to parallelism differ from other approaches?
  • What are the three types of rules in Dedalus? See Fig. 1.
Dec. 7
Stonebraker's blog on NoSQL (csenetid, uwnetid)
Add a short comment to the review of the main paper
Catell: NoSQL (csenetid, uwnetid)
Skip most of 2.1-2.6; skip 3.4, 5.3-5.7, 7.3-7.4
Column-store Databases (csenetid, uwnetid)

Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others)

  • What are the main tradeoffs in column-store databases (i.e. when are they better, and when are they worse than row-oriented databases)?
  • What is the advantage of compression in databases?
  • What is "late materialization"?
  • Explain this statement: "compared to 1980 disk I/O looks 100 times slower to the CPU".