CSED 516: Reading Assignments Schedule, Fall 2021

Please do not repost or otherwise distribute the materials available from this website. Some material is available freely on the web, other is behind a paywall, other is private and we only have permission to use the material in class, not distribute.

All reviews are due before the beginning of the lecture. There are no late days for paper reviews.

October 12. Review 1

Submit your review here.

What goes around Read sections 1-5 and 10. The other sections are not recommended and we will not discuss them in class.
A Case Against SQL

Some suggested topics for discussion in your review:

What is physical and logical data independence?
Briefly compare data independence in IMS, Codasyl, and the relational model.
Speculate what led to the decline of IMS / Codasyl and rise of the relational model.
Explain briefly three peculiar behaviors of SQL

October 19. Review 2

Submit your review here.

How good are they?

This is a very good paper; I recommend reading it entirely. We will discuss several aspects of this paper in class.

For those interested in additional information: this video describes the optimizer of SQL Server, which some consider to be the best in industry.

October 26. Review 3

Submit your review here.

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04.
- Read only sections 1,2,3
D. DeWitt and M. Stonebraker. Mapreduce – a major step backward. In Database Column (Blog), 2008.
Ashish Thusoo et al: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005.
- Read sections 1, 2, and skim through section 4 (focus on the optimizations)

November 2. Review 4

Submit your review here.

M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012
McSherry Scalability, but at what cost?

November 9. Review 5

Submit your review here.

Anurag Gupta, et al.Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15).
- Skim over the paper (it's very high level)
- Suggested discussion points: What does an redshift cluster consist of? What types of data partition does redshift support? (We will discuss these in detail in a later lecture, but it's easy to imagine what they do.) What was the key metric for the redshift design team? How long (seconds or minutes) did it typically take to launch a redshift cluster?
Dageville et al, The Snowflake Elastic Data Warehouse. SIGMOD Conference 2016: 215-226.
- Read sections 1,2,3, skim over 4, and read sec. 6
- Suggested discussion topics:
  - What is elasticity, why is it important, and how is it supported in Snowflake?
  - How is data storage handled in Snowflake, and why? What would have been the alternatives?
  - How are worker failures handled in Snowflake? How does this compare to MapReduce?
  - How does snowflake handle semistructure data?

November 16. Review 6

Submit your review here.

BigQuery (dremel)

This paper has lots of important information; please read carefully. Some useful things to know, and some suggestions for questions discussion topics:

The "Protocol Buffer" is Google's proprietary data format, very similar to JSon; wherever you read "protocol buffer" imagine "JSon" instead.
At some point in its history, Dremel moved from local storage (on the local disk) to disaggregated storage (in GFS). What happened when dremel first did that, and why?
What is "serverless computing" and where did it originate?
Is the query plan cost estimation in Dremel better, or worse than in traditional Relational Database Management Systems (RDBMS)?
All modern relational engines need to offer support for JSon or related data formats. This is challenging, since in JSon we can write nested collections, while the relational data model requires data to be in 1st normal form (1NF). Dremel allows for the data to be nested (hence Non-1NF, or NFNF), and uses a clever encoding to represent nested data. The original encoding is shown in Fig. 6, and it's rather difficult to understand; you don't need to spend too much time trying to understand it. They now switched to the new encoding in Fig. 7, which is much easier to understand.

November 23. Review 7

Submit your review here.

The Design and Implementation of Modern Column-Oriented Database Systems.
- Read sections 1, 2, skim over Sec. 3
- Read sections 4.1, 4.4., 4.5

Suggested discussion points:

What are the differences between column and row oriented data stores?
Discuss at least one technique from Section 4.
What are column stores good for?