In this
advanced graduate course, we will analyze the design and study the
effectiveness and performance of a selection of big data management
systems. We will study both batch- and stream-processing systems.
Instructor: Magdalena (magda) Balazinska, magda at cs.washington.edu.
Office hour: Mondays 12pm-1pm in CSE584.
TA: Cyrus Rashtchian, cyrash at cs.washington.edu. Office hour:
Fridays 9:30am-10:30am in the theory lab (CSE 306).
Lectures: Mondays and Wednesdays -- 9am-10:20am
Location: MGH 251
The workload in the class involves the following:
- 20 %: Read assigned papers and participate in class discussions.
- 40 %: Co-organize the demonstration and hands-on tutorial for one big
data management system.
- Add your name to the schedule spreadsheet.
- Plan to meet with course instructor a week and a half prior
to your demo and tutorial.
- 40 %: Perform a focused benchmark-study of a set of big data
systems. Present the results of the study in class at the end of the
quarter. You can work in teams of up to four people. A team of N
people should do a comparative study that involves N systems.
- April 17th: One page project idea description.
- May 8th: Two-page project milestone.
- May 31st: Final project presentations.
- June 6th: Final project write-ups due. Formatted as
a blog post.
Link to FINAL PROJECTS REPOSITORY.
Link to GRADEBOOK.
Link to DROP BOX.
Please use the dropbox to submit your project idea, milestone, and final paper.
An exciting component of this course are practical, hands-on
tutorials in class. All tutorial materials are publicly available on
GitHub in the following repository: https://github.com/mbalazin/cse599c-17sp-tutorials
Note that this schedule is subject to change, so please check
this website regularly for updates.
How it all fits together?
Week 1: Parallel DBMSs & MapReduce
- [Read this paper] DeWitt and Gray, "Parallel Database Systems: The Future of High
Performance Database Systems," Communications of the
ACM. 1992. Sections 1 and 2 only [pdf].
- [Skim this paper] DeWitt et al, "The Gamma Database Machine Project," IEEE
Transactions on Knowledge and Data Engineering, Volume 2 Issue 1,
March 1990, p. 44-62 [pdf].
- [Skim this paper] Goetz Graefe. Encapsulation of parallelism in the Volcano query
processing system.
SIGMOD 1990. [pdf].
- [Read this paper] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. OSDI'04. [pdf].
Week 2: Best of Both Worlds Integration
- [Skim for Monday] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi,
David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of
approaches to large-scale data analysis. SIGMOD 2009:
165-178. [pdf].
- [Read for Monday] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive
- a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005. [pdf].
- [Read for Wednesday] Teradata Database: Introduction to Teradata. Release 14.0
Chapter 4 only. [pdf].
- [Read for Wednesday] Hybrid Row-Column Partitioning in Teradata
Mohammed Al-Kateb, Paul Sinclair, Grace Au, Carrie
Ballinger. PVLDB2016 [pdf].
- [Skim for Wednesday] Integrating hadoop and parallel DBMs
Yu Xu, Pekka Kostamaa, Like Gao. SIGMOD Conference 2010. [pdf].
- [Skim for Wednesday] Teradata Aster Database. [pdf].
Week 3: Column-store DBMSs
- [Read sections] The Design and Implementation of Modern Column-Oriented Database Systems Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. Foundations and TrendsĀ® in Databases (Vol 5, Issue 3, 2012, pp 197-280) Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3). [pdf].
- [Read] A. Lamb, M. Fuller, R. Varadarajan, N. Tran,
B. Vandiver, L. Doshi, and C. Bear. The Vertica Analytic Database: C-store 7 Years Later.
Proc. VLDB Endow., 5(12):1790-1801, Aug. 2012. [pdf].
Week 4: In-memory analytics
- [Read] M. Zaharia et al. Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster computing. In NSDI,
2012. [pdf].
- [Read] Spark SQL: Relational Data Processing in Spark
Michael Armbrust, Reynold Xin, Yin Huai, Davies Liu, Joseph
K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali
Ghodsi, Matei Zaharia. ACM SIGMOD Conference 2015, May. 2015. [pdf].
Week 5: Parallel DBMS on Hadoop
- [Read] M. Kornacker et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015. [pdf].
Week 6: University of Washington Big Data Engine
- [Read] The Myria Team. The Myria Big Data Management and
Analytics System and Cloud Services. In CIDR 2017 [pdf].
Week 7: Machine-Learning Focused Systems
- [Read] M.Abadi et al.TensorFlow:Large-scale machine learning on
heterogeneous systems. In OSDI, 2016. [pdf].
Week 9: Stream and Batch Processing
- [Reading] Streaming
101 and 102
blogs are a good read and 102 takes some time if the students really
dig deep. They link transitively to the Millwheel and Spark
Streaming papers, which are deeper technical reads and
interesting. I'd recommend those two, plus the VLDB paper on
Dataflow:
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
* Subscription: If you are registered for this class, your email
address @u.washington.edu will automatically be added to the class
mailing list (refreshed daily). You can setup a forward address at
myuw.washington.edu or change your subscription address here.
* Archive: You can access the archive for the class mailing list HERE.