DATA516/CSED516: Scalable Data Systems and Algorithms, Fall 2018

In this course, we will study the specialized systems and algorithms that have been developed to work with data at scale, including parallel database systems, MapReduce and its contemporaries, graph systems, streaming systems, and others. We will also go over core techniques of cloud platforms; and important scalable algorithms.

Instructor: Magdalena (magda) Balazinska, magda at cs.washington.edu. Office hour: Tuesdays 4:30pm-5pm in class or over skype by appointment.

TA: Parmita Mehta, parmita at cs.washington.edu. Office hour: Thursdays 4pm-5pm in CSE482.

TA: Matthew Liu, liux44@ at cs.washington.edu. Office hour: 4:30pm-5:30pm in 4th Floor (Breakouts/Studies) area

Lectures: Tuesdays -- 5pm - 7:50pm

Sections: Tuesdays -- 8pm -8:50pm

Location: ARC G070.

The workload in the class involves the following:

15 %: Reading assigned papers and writing short statements. Each statement should be at most one page in length written as a set of bullet points. The statement should demonstrate that you read and thought about the paper.
45 %: Three hands-on assignments using big data systems in the Amazon Web Services cloud (Redshift, Hadoop/Hive, Spark).
15 %: Mini hands-on assignments using other big data systems (Vertica, GraphX, Beam). Please complete two out of three mini-assignments. You can choose which two. If you submit all three, we will drop your lowest score.
25 %: Short final project
- The final project is open ended. The only requirement is that it uses a big data system. We recommend one of the following: Build on one of the systems that we used in class and do something more with that system. For example, you can try to benchmark a system's performance on a specific task (e.g., ingesting data, making backups, rescaling a cluster, etc) or on a new set of queries (e.g., try to run queries on a different dataset or try to run new types of queries). Another good type of project is to try a system that we did not have a chance to try in class and report what you find in terms of ease of use and performance. The best projects will ask a question and will vary one or more parameters to see how performance changes. We expect each final report to include at least one graph with some performance numbers (you can have more than one graph). Beyond the one graph, you can also include a qualitative discussion based on what you observed and what you read about in the documentation. The scale of the final project should be similar to the scale of the other three assignments.
- The most important component of the final project is the final project report, due during finals week. Please submit a 4-5 page report. Use the ACM SIG template (either word or latex, whichever you prefer). The final project report should have the following sections:
  - Introduction: Motivation and executive summary of what you did and what you found.
  - Section 2 Evaluated System(s): Brief description of the overall architecture of the system or systems that you worked with. Focus in particular on architectural elements that are relevant to your final project. You can reuse figures showing the system architecture from the system documentation, but indicate where they come from if you do that.
  - Section 3 - Problem Statement and Method: What question did you ask? What did you try? What approach did you take? What method did you use?
  - Section 4 - Results: What did you find? Include both quantitative and qualitative results.
  - Section 5 - Conclusion
  - We will also have final presentations in class on the last day. Each person will get 4 min to present and should have 3-4 slides. The presentation template is here. The template includes the three slides that we recommend each presentation to have.
  - We will assign a grade out of 10 for the final project. 2 points for the presentation and 8 points for the report.
All assignments and projects are to be done individually.
Please submit your readings, assignments, mini-assignments, final project presentation, and final project report by adding them to your gitlab repository.
Link to Final Project Presentations Schedule.
Link to GRADEBOOK.
Link to Message Board.

Sections
Each week, after lecture, we will have a 50-min section that will give you hands-on demonstrations and tutorials of various big data systems and cloud services. Each section will be connected to either a full assignment or a mini assignment.

Course Schedule

The schedule is subject to change, so please check this website regularly for updates.

How it all fits together?
- RDBMS Genealogy
- Landscape of Big Data Systems
Readings

Week 1: Relational Database Management Systems (review)
- No readings. Welcome to class!
Week 2: Parallel shared-nothing DBMSs & Cloud Deployments
- DeWitt and Gray, "Parallel Database Systems: The Future of High Performance Database Systems," Communications of the ACM. 1992. Section 2 [pdf]. This is an old paper. Old papers can be especially confusing to read. As you read this paper, write down your questions and make sure to ask them in class. Focus on Section 2. Read Section 1 but it does go into some old context so don't worry if you get confused as you read that section.
- Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). [pdf]. Additional informationa about Redshift can also be found on the AWS website: https://aws.amazon.com/redshift/.
Week 3: MapReduce (MapReduce/Hadoop)
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04. [pdf].
Week 4: Best of Both Worlds Integration
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005. [pdf].
- [This one is super quick to read. No need to include in your write-up] Teradata Aster Database. [pdf].
- Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Chapter 3 (sections 3.1 through 3.4). [pdf]
Week 5: In-memory analytics
- M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. [pdf].
- MLlib: Machine Learning in Apache Spark Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar Journal of Machine Learning Research, 17 (34), Apr. 2016. [pdf] and also online documentation available here. Make sure to click on "MLLib Guide".
- [Optional paper - Read only if you want] Spark SQL: Relational Data Processing in Spark Michael Armbrust, Reynold Xin, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali Ghodsi, Matei Zaharia. ACM SIGMOD Conference 2015, May. 2015. [pdf].
Week 6: Graph Processing
- Da Yan, Yingyi Bu, Yuanyuan Tian and Amol Deshpande (2017), "Big Graph Analytics Platforms", Foundations and Trends in Databases: Vol. 7: No. 1-2, pp 1-195. Read Chapters 3-5.[pdf].
Week 7: Column-store DBMSs
- [Submit a write-up] The Design and Implementation of Modern Column-Oriented Database Systems Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280) Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3). [pdf].
- [No write-up required] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis: Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3(1): 330-339 (2010). [pdf].
Week 8: Stream Processing
- [Submit a write-up] Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: a new model and architecture for data stream management. The VLDB Journal 12, 2 (August 2003), 120-139. [pdf].
- [Submit a write-up] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 12 (August 2015), 1792-1803. [pdf].
- [No write-up required] Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Chapter 4 (sections 4.1 through 4.4). [pdf]
Homework
- Homework 1 [Amazon Redshift]
- Homework 2 [Apache Hadoop and Hive]
- Homework 3 [Apache Spark]
- Mini-Homework 1 [GraphX]
- Mini-Homework 2 [Vertica]
- Mini-Homework 3 [Apache Beam]
Course Mailing List

* Subscription: If you are registered for this class, your email address @u.washington.edu will automatically be added to the class mailing list (refreshed daily). You can setup a forward address at myuw.washington.edu or change your subscription address here.
* Archive: You can access the archive for the class mailing list HERE.

Welcome to DATA516/CSED516: Scalable Data Systems and Algorithms!

Course Staff

Lecture Times and Location

Workload and Grading

Sections

Course Schedule

Readings

Homework

Course Mailing List