In this course, we will study the specialized systems and algorithms that have been developed
to work with data at scale, including parallel database
systems, MapReduce and its contemporaries, graph systems,
streaming systems, and
others. We will also go over core techniques of cloud platforms; and important
scalable algorithms.
Instructor: Magdalena (magda) Balazinska, magda at cs.washington.edu.
Office hour: Tuesdays 4:30pm-5pm in class or over skype by appointment.
TA: Parmita Mehta, parmita at cs.washington.edu. Office hour:
Thursdays 4pm-5pm in CSE482.
TA: Matthew Liu, liux44@ at cs.washington.edu. Office hour: 4:30pm-5:30pm in 4th Floor (Breakouts/Studies) area
Lectures: Tuesdays -- 5pm - 7:50pm
Sections: Tuesdays -- 8pm -8:50pm
Location: ARC G070.
The workload in the class involves the following:
- 15 %: Reading assigned papers and writing short statements.
Each statement should be at most one page in length written as a set of bullet points. The
statement should demonstrate that you read and thought about the paper.
- 45 %: Three hands-on assignments using big data systems in the
Amazon Web Services cloud (Redshift, Hadoop/Hive, Spark).
- 15 %: Mini hands-on assignments using other big data systems
(Vertica, GraphX, Beam).
Please complete two out of three
mini-assignments. You can choose which two. If you submit all three,
we will drop your lowest score.
- 25 %: Short final project
- The final project is open ended. The only requirement is that it uses a big data system.
We recommend one of the following: Build on one of the systems that we used in class and do
something more with that system. For example, you can try to benchmark a system's performance
on a specific task (e.g., ingesting data, making backups, rescaling a cluster, etc) or
on a new set of queries (e.g., try to run queries on a different dataset or try to run new types of queries).
Another good type of project is to try a system that we did not
have a chance to try in class and report
what you find in terms of ease of use and performance. The best projects will ask a question and will vary one or more
parameters to see how performance changes. We expect each final
report to include at least one graph with some performance
numbers (you can have more than one graph). Beyond the one graph, you can also include a qualitative
discussion based on what you observed and what you read about in
the documentation. The scale of the final project should be similar to
the scale of the other three assignments.
- The most important component of the final project is the final project report, due
during finals week. Please submit a 4-5 page report. Use
the
ACM SIG template (either word or latex, whichever you
prefer). The final project report should have the following
sections:
- Introduction: Motivation and executive summary of what you
did and what you found.
- Section 2 Evaluated System(s): Brief description of the
overall architecture of the system or systems that you worked
with. Focus in particular on architectural elements that are
relevant to your final project. You can reuse
figures showing the system architecture from the system
documentation,
but indicate where they come from if you do that.
- Section 3 - Problem Statement and Method: What question did
you ask? What did you try? What approach did you take? What method did you use?
- Section 4 - Results: What did you find? Include both
quantitative and qualitative results.
- Section 5 - Conclusion
- We will also have final presentations in class on the last
day. Each person will get 4 min to present and should have 3-4
slides. The
presentation template is here. The template includes the
three slides that we recommend each presentation to have.
- We will assign a grade out of 10 for the final project. 2 points for the
presentation and 8 points for the report.
All assignments and projects are to be done individually.
Please submit your readings, assignments, mini-assignments, final project presentation, and final
project report by adding them to your gitlab repository.
Link to Final Project Presentations Schedule.
Link to GRADEBOOK.
Link to Message Board.
Each week, after lecture, we will have a 50-min section that will
give you hands-on demonstrations and tutorials of various big
data systems and cloud services. Each section will be connected to
either a full assignment or a mini assignment.
The schedule is subject to change, so please check
this website regularly for updates.
How it all fits together?
Week 1: Relational Database Management Systems (review)
- No readings. Welcome to class!
Week 2: Parallel shared-nothing DBMSs & Cloud Deployments
- DeWitt and Gray, "Parallel Database Systems: The Future of High
Performance Database Systems," Communications of the
ACM. 1992. Section 2 [pdf].
This is an old paper. Old papers can be especially confusing to
read. As you read this paper, write down your questions and make
sure to ask them in class. Focus on Section 2. Read Section 1 but it
does go into some old context so don't worry if you get confused as
you read that section.
- Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul
Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon
Redshift and the Case for Simpler Data Warehouses. In Proceedings of
the 2015 ACM SIGMOD International Conference on Management of Data
(SIGMOD '15). [pdf]. Additional informationa about Redshift can also
be found on the AWS website: https://aws.amazon.com/redshift/.
Week 3: MapReduce (MapReduce/Hadoop)
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. OSDI'04. [pdf].
Week 4: Best of Both Worlds Integration
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive
- a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005. [pdf].
- [This one is super quick to read. No need to include in your write-up]
Teradata Aster Database. [pdf].
- Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining
of Massive Datasets. Chapter 3 (sections 3.1 through 3.4). [pdf]
Week 5: In-memory analytics
- M. Zaharia et al. Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster computing. In NSDI,
2012. [pdf].
- MLlib: Machine Learning in Apache Spark
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar
Journal of Machine Learning Research, 17 (34), Apr. 2016. [pdf]
and also online documentation available here. Make sure
to click on "MLLib Guide".
- [Optional paper - Read only if you want] Spark SQL: Relational Data Processing in Spark
Michael Armbrust, Reynold Xin, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali
Ghodsi, Matei Zaharia. ACM SIGMOD Conference 2015, May. 2015. [pdf].
Week 6: Graph Processing
- Da Yan, Yingyi Bu, Yuanyuan Tian and Amol Deshpande (2017), "Big Graph Analytics Platforms", Foundations and Trends in Databases: Vol. 7: No. 1-2, pp 1-195. Read Chapters 3-5.[pdf].
Week 7: Column-store DBMSs
- [Submit a write-up] The Design and Implementation of Modern Column-Oriented Database Systems Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280) Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3). [pdf].
- [No write-up required] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis:
Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3(1):
330-339 (2010). [pdf].
Week 8: Stream Processing
- [Submit a write-up] Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: a new model and architecture for data stream management. The VLDB Journal 12, 2 (August 2003), 120-139. [pdf].
- [Submit a write-up] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 12 (August 2015), 1792-1803. [pdf].
- [No write-up required] Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining
of Massive Datasets. Chapter 4 (sections 4.1 through 4.4). [pdf]
* Subscription: If you are registered for this class, your email
address @u.washington.edu will automatically be added to the class
mailing list (refreshed daily). You can setup a forward address at
myuw.washington.edu or change your subscription address here.
* Archive: You can access the archive for the class mailing list HERE.