CSED 516: Reading Assignments Schedule, Fall 2022
Please do not repost or otherwise distribute the materials
available from this website. Some material is available freely on
the web, other is behind a paywall, other is private and we only
have permission to use the material in class, not distribute.
All reviews are due before the beginning of the lecture. There
are
no late days for paper reviews.
October 11. Review 1
Submit your review on Canvas
- What goes around Read sections 1-5 and 10. The other sections are not recommended and we will not discuss them in class.
- A Case Against SQL
Some suggested topics for discussion in your review:
- What is physical and logical data independence?
- Briefly compare data independence in IMS, Codasyl, and the relational model.
- Speculate what led to the decline of IMS / Codasyl and rise of the relational model.
- Explain briefly three peculiar behaviors of SQL
October 18. Review 2
Submit your review on Canvas
- How good are they?
For those interested in additional information:
this video
describes the optimizer of SQL Server, which some consider to be
the best in industry.
October 26. Review 3
Submit your review on Canvas
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04.
- D. DeWitt and M. Stonebraker. Mapreduce – a major step backward. In Database Column (Blog), 2008.
- Ashish Thusoo et al: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005.
- Read sections 1, 2, and skim through section 4 (focus on the optimizations)
Suggested discussion topics:
- How do these three papers fit gogether? What is the big story that they are telling?
- What are some advantages/disadvantages of MapReduce compared to parallel databases? How does MapReduce address skew? How does MapReduce address worker failures?
- Why was MapReduce needed in Facebook? Why was it insufficient? What are some limitations of HiveSQL? Why does it only support only INSERT OVERWRITE? What are some of the optimizations in Hive?
November 1. Review 4
Submit your review on Canvas
- M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012
- McSherry Scalability, but at what cost?
Suggested discussion topics:
- What are the main novel concepts in Spark over MapReduce
- What are the main lessons from McSherry's blog?
November 8. Review 5
Submit your review on Canvas
- Anurag Gupta, et al.Amazon Redshift and the Case for Simpler Data Warehouses.
In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15).
- Skim over the paper (it's very high level)
- Suggested discussion points: What does an redshift cluster consist of? What types of data partition does redshift support? (We will discuss these in detail in a later lecture, but it's easy to imagine what they do.) What was the key metric for the redshift design team? How long (seconds or minutes) did it typically take to launch a redshift cluster?
- Dageville et al, The Snowflake Elastic Data Warehouse. SIGMOD Conference 2016: 215-226.
- Read sections 1,2,3, skim over 4, and read sec. 6
- Suggested discussion topics:
- What is elasticity, why is it important, and how is it supported in Snowflake?
- How is data storage handled in Snowflake, and why? What would have been the alternatives?
- How are worker failures handled in Snowflake? How does this compare to MapReduce?
- How does snowflake handle semistructure data?
November 15. Review 6
Submit your review on Canvas
- The Design and Implementation of Modern Column-Oriented Database Systems.
- Read sections 1, 2, skim over Sec. 3
- Read sections 4.1, 4.4., 4.5
Suggested discussion points:
- What are the differences between column and row oriented data stores?
- Discuss at least one technique from Section 4.
- What are column stores good for?