CSE 544: Reading Assignments

The drop box for paper reviews is https://catalyst.uw.edu/collectit/dropbox/summary/pkoutris/21136

Week of	Paper(s)	Questions to Address
March 26	Optional: Query Language Primer Optional: Finite Model Theory (Ch. 1 and 2)
April 2	Required: Answering queries using views	Sections 1, 2, 3 only. What are some key applications? What is the difference between an equivalent rewriting and a maximally contained rewriting? What is the difference between query answering and query rewriting?
	Optional: Optimizing queries using materialized views
	Required: What goes around comes around	Sections 1, 2, 3, 4, 8, 9, 10, 11 (in other words, skip 5-7)
April 9	Required: The Anatomy of a Database System	Sections 1, 2, 3, 4 only What are the advantages/disadvantages of using a thread v.s. a process per connection? What are the three kinds of parallel database architectures? what are their pros/cons? Why do database systems use a different buffer pool from that of the OS? What is the iterator model for query execution?
April 9	Required: Join processing in database systems with large main memories	Sections 1,2 only What is the "fudge factor"? The main memory hash-join algorithm (called classic hashing in the paper) stores one relation in main memory. What happens to the algorithm if the amount of main memory is smaller than the size of the relation? What is grace-hash-join? What is hybrid-hash-join?
April 18	Required: Consistently Estimating the Selectivity of Conjuncts of Predicates	Sections 1,2,3 only What are the challenges in using multidimensional histograms in cardinality estimation (and plan cost estimation)? Suppose we have one dimensional histograms on R.A, R.B, R.C, R.AB, and R.BC. How do current systems estimate the size of a query SELECT * FROM R where A='a' and B='b' and C='c'? Discuss separately DB2 and SQL Server. How does the approach in the paper estimate the size of the query?
April 25	Required: MapReduce	Sections 1,2,3,4 only. To answer the questions below you may need to do Homework 2 first. What is the difference between a map job and a map task? How is the number of map tasks determined? How is the number of reduce tasks determined? What are the pros/cons in choosing a large/small number of map tasks and reduce tasks? A reduce task has three phases: copy, sort, and reduce. Which of these phases may be started while the map tasks are still running, and which phases cannot be started until all map tasks have completed?
April 30	Required: The Declarative Imperative	Sections 1,2,3 only What is the urgency of parallelism? How does the datalog approach to parallelism differ from other approaches? What are the three types of rules in Dedalus? See Fig. 1.
April 30	Optional: Datalog with negation (Chapter 14)
May 7	Optional: book on Finite Model Theory (ps, pdf)
May 14
May 21	Required: Provenance Semirings	Sections 1,2,3,4 only
May 28	Required: Private Data Analysis	Read the entire paper.