In this course, we will study the specialized systems and algorithms that have been developed
to work with data at scale, including parallel database
systems, MapReduce and its contemporaries, graph systems,
streaming systems, and
others. We will also go over core techniques of cloud platforms; and important
scalable algorithms.
Instructor: Magdalena (magda) Balazinska, magda at cs.washington.edu.
Office hour: Tuesdays 4pm-5pm in CSE584.
TA: Parmita Mehta, parmita at cs.washington.edu. Office hour:
Thursdays 4pm-5pm in CSE482.
Lectures: Tuesdays -- 5pm - 7:50pm
Sections: Tuesdays -- 8pm -8:50pm
Location: CSE 403
The workload in the class involves the following:
- 20 %: Reading assigned papers and writing short statements on
course slack channel called csed516-papers (link below).
- 40 %: Two hands-on assignments using big data systems in the
Amazon Web Services cloud.
- 40 %: Final project
- Part 1 of the project asks you to build on homework 1 and
exercise Amazon Redshift in some interesting way: Try to assess
its performance on certain types of queries, evaluate its other
features, or try to ingest and analyze a different dataset. For
part 1, please submit a short 3-page write-up describing what you
did and what you found. The write-up should be formatted like a
conference paper.
- Part 2 of the project asks you to build on homework 2 and
similarly exercise Apache Spark in some interesting way. Please
also submit a 3-page write-up describing what you did and what you found.
- The final report asks you to put part 1 and part 2
together into a final 6 to 8 page report. While we will provide
feedback after part 1 and part 2, only the final report will be graded and
it is due during finals week (see schedule below).
- We will also have final presentations in class on the last day.
All assignments and projects are to be done individually.
Link to DROPBOX.
Link to GRADEBOOK.
Link to Final Project Presentation Schedule.
Each week, after lecture, we will have a 50-min section that will
give you hands-on demonstrations and tutorials of various big
data systems and cloud services.
The schedule is subject to change, so please check
this website regularly for updates.
How it all fits together?
Week 1: Relational Database Management Systems (review)
- No readings. Welcome to class!
Week 2: Parallel shared-nothing DBMSs & Cloud Deployments
- DeWitt and Gray, "Parallel Database Systems: The Future of High
Performance Database Systems," Communications of the
ACM. 1992. Section 2 [pdf].
This is an old paper. Old papers can be especially confusing to
read. As you read this paper, write down your questions and make
sure to ask them in class. Focus on Section 2. Read Section 1 but it
does go into some old context so don't worry if you get confused as
you read that section.
- Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul
Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon
Redshift and the Case for Simpler Data Warehouses. In Proceedings of
the 2015 ACM SIGMOD International Conference on Management of Data
(SIGMOD '15). [pdf]. Additional informationa about Redshift can also
be found on the AWS website: https://aws.amazon.com/redshift/.
Week 3: MapReduce (MapReduce/Hadoop)
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. OSDI'04. [pdf].
Week 4: Best of Both Worlds Integration
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive
- a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005. [pdf].
- [This one is super quick to read. No need to comment on slack]
Teradata Aster Database. [pdf].
- Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining
of Massive Datasets. Chapter 3 (sections 3.1 through 3.4). [pdf]
Week 5: In-memory analytics
- M. Zaharia et al. Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster computing. In NSDI,
2012. [pdf].
- MLlib: Machine Learning in Apache Spark
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar
Journal of Machine Learning Research, 17 (34), Apr. 2016. [pdf]
and also online documentation available here. Make sure
to click on "MLLib Guide".
- [Optional paper - Read only if you want] Spark SQL: Relational Data Processing in Spark
Michael Armbrust, Reynold Xin, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali
Ghodsi, Matei Zaharia. ACM SIGMOD Conference 2015, May. 2015. [pdf].
Week 6: In-depth Spark tutorial
Week 7: Column-store DBMSs and Big Data Systems Wrap Up
- The Design and Implementation of Modern Column-Oriented Database Systems Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280) Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3). [pdf].
- The Myria Big Data Management and Analytics System and Cloud
Services The Myria Team. CIDR 2017 [pdf].
Week 8: Graph Processing
- Da Yan, Yingyi Bu, Yuanyuan Tian and Amol Deshpande (2017), "Big Graph Analytics Platforms", Foundations and Trends in Databases: Vol. 7: No. 1-2, pp 1-195. Read Chapters 3-5.[pdf].
Week 9: Stream Processing
- Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: a new model and architecture for data stream management. The VLDB Journal 12, 2 (August 2003), 120-139. [pdf].
- Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 12 (August 2015), 1792-1803. [pdf].
- Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining
of Massive Datasets. Chapter 4 (sections 4.1 through 4.4). [pdf]
* Subscription: If you are registered for this class, your email
address @u.washington.edu will automatically be added to the class
mailing list (refreshed daily). You can setup a forward address at
myuw.washington.edu or change your subscription address here.
* Archive: You can access the archive for the class mailing list HERE.