CSED 516: Reading Assignments Schedule, Fall 2022

Please do not repost or otherwise distribute the materials available from this website. Some material is available freely on the web, other is behind a paywall, other is private and we only have permission to use the material in class, not distribute.

All reviews are due before the beginning of the lecture. There are no late days for paper reviews.

October 11. Review 1

Submit your review on Canvas
  1. What goes around Read sections 1-5 and 10. The other sections are not recommended and we will not discuss them in class.
  2. A Case Against SQL
Some suggested topics for discussion in your review:

October 18. Review 2

Submit your review on Canvas
  1. How good are they?

For those interested in additional information: this video describes the optimizer of SQL Server, which some consider to be the best in industry.

October 26. Review 3

Submit your review on Canvas
  1. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04.
    • Read only sections 1,2,3
  2. D. DeWitt and M. Stonebraker. Mapreduce – a major step backward. In Database Column (Blog), 2008.
  3. Ashish Thusoo et al: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005.
    • Read sections 1, 2, and skim through section 4 (focus on the optimizations)
Suggested discussion topics:

November 1. Review 4

Submit your review on Canvas
  1. M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012
  2. McSherry Scalability, but at what cost?
Suggested discussion topics:

November 8. Review 5

Submit your review on Canvas
  1. Anurag Gupta, et al.Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15).
    • Skim over the paper (it's very high level)
    • Suggested discussion points: What does an redshift cluster consist of? What types of data partition does redshift support? (We will discuss these in detail in a later lecture, but it's easy to imagine what they do.) What was the key metric for the redshift design team? How long (seconds or minutes) did it typically take to launch a redshift cluster?
  2. Dageville et al, The Snowflake Elastic Data Warehouse. SIGMOD Conference 2016: 215-226.
    • Read sections 1,2,3, skim over 4, and read sec. 6
    • Suggested discussion topics:
      • What is elasticity, why is it important, and how is it supported in Snowflake?
      • How is data storage handled in Snowflake, and why? What would have been the alternatives?
      • How are worker failures handled in Snowflake? How does this compare to MapReduce?
      • How does snowflake handle semistructure data?

November 15. Review 6

Submit your review on Canvas
  1. The Design and Implementation of Modern Column-Oriented Database Systems.
    • Read sections 1, 2, skim over Sec. 3
    • Read sections 4.1, 4.4., 4.5
Suggested discussion points: