CSED 516: Reading Assignments Schedule, Fall 2020

Week 1: no class

The quarter starts on Wednesday, 9/30, first lecture on Tuesday, 10/6.

Week 2: no reading assignment (first lecture is introduction).

Week 3: Review 1, Basic Architecture of Parallel DBMS: submission form

DeWitt and Gray, Parallel Database Systems: The Future of High Performance Database Systems, Communications of the ACM. 1992. [pdf].
- Skim over the introduction, then only read Section 2 Basic Techniques...
- Suggested discussion points in your review. What is the consensus architecture for parallel database systems? Which type of scaleup is applicable to machine learning tasks on large training data?
Anurag Gupta, et al.Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15).
- Skim over the paper (it's very high level)
- Suggested discussion points: What does an redshift cluster consist of? What types of data partition does redshift support? (We will discuss these in detail in a later lecture, but it's easy to imagine what they do.) What was the key metric for the redshift design team? How long (seconds or minutes) did it typically take to launch a redshift cluster?
Additional information about Redshift can also be found on the AWS website

Week 4: Review 2, MapReduce, Spark: submission form

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04.
- Read only sections 1,2,3
D. DeWitt and M. Stonebraker. Mapreduce – a major step backward. In Database Column (Blog), 2008.
Ashish Thusoo et al: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005.
- Read sections 1, 2, and skim through section 4 (focus on the optimizations)
M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012
Suggested discussion topics:
- How do these three papers fit gogether? What is the big story that they are telling?
- What are some advantages/disadvantages of MapReduce compared to parallel databases? How does MapReduce address skew? How does MapReduce address worker failures?
- Why was MapReduce needed in Facebook? Why was it insufficient? What are some limitations of HiveSQL? Why does it only support only INSERT OVERWRITE? What are some of the optimizations in Hive?

Week 5: Review 3, Snowflake, Dremel (Bigquery): submission form

Dageville et al, The Snowflake Elastic Data Warehouse. SIGMOD Conference 2016: 215-226.
- Read sections 1,2,3, skim over 4, and read sec. 6
- Suggested discussion topics:
  - What is elasticity, why is it important, and how is it supported in Snowflake?
  - How is data storage handled in Snowflake, and why? What would have been the alternatives?
  - How are worker failures handled in Snowflake? How does this compare to MapReduce?
  - How does snowflake handle semistructure data?
Dremel/bigquery This paper has lots of important information; please read carefully. Some useful things to know, and some suggestions for questions discussion topics:
- The "Protocol Buffer" is Google's proprietary data format, very similar to JSon; wherever you read "protocol buffer" imagine "JSon" instead.
- At some point in its history, Dremel moved from local storage (on the local disk) to disaggregated storage (in GFS). What happened when dremel first did that, and why?
- What is "serverless computing" and where did it originate?
- Is the query plan cost estimation in Dremel better, or worse than in traditional Relational Database Management Systems (RDBMS)?
- All modern relational engines need to offer support for JSon or related data formats. This is challenging, since in JSon we can write nested collections, while the relational data model requires data to be in 1st normal form (1NF). Dremel allows for the data to be nested (hence Non-1NF, or NFNF), and uses a clever encoding to represent nested data. The original encoding is shown in Fig. 6, and it's rather difficult to understand; you don't need to spend too much time trying to understand it. They now switched to the new encoding in Fig. 7, which is much easier to understand.

Week 8: Week 9: Review 4, Advanced Query Processing: submission form

Paraschos Koutris, Semih Salihoglu, Dan Suciu: Algorithmic Aspects of Parallel Data Processing, Foundations and Trends in Databases 2018.
- Section 2: read 2.1, and 2.2.2
- Section 3: read 3.1.1, 3.1.2, skim over 3.1.3, then read 3.3
- 4.1.1 (takeaway: the Hypercube Algorithm)
- 4.1.2 (takeaway: how to compute the optimal shares)
- 4.1.3 (takeaway: quick way to compute optimal shares for a uniform db)
- 4.1.4 (takeaway: quick way to compute the optimal load for a non-uniform db)
- 4.1.6 (takeaway: make sure you understand well example 4.5)
- Suggested discussion points:
  - what is the significance of Brent’s theorem?
  - What is the main challenge in computing a join on a distributed cluster?
  - What is the degree threshold for a parallel join?
  - What is the trade-off between computing a query in one round or multiple rounds of communication?
  - What speedup does the HyperCube algorithm achieve: linear? Sublinear? Explain.
  - Why is Broadcast Join a special case of the HyperCube algorithm?

Week 10: Review 5, Column Store: submission form

Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. The Design and Implementation of Modern Column-Oriented Database Systems. Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280)
- Read sections 1, 2, skim over Sec. 3
- Read sections 4.1, 4.4., 4.5
- Suggested discussion points:
  - What are the differences between column and row oriented data stores?
  - Discuss at least one technique from Section 4.
  - What are column stores good for?