CSE599c:Big Data Management Systems, Spring 2017

In this advanced graduate course, we will analyze the design and study the effectiveness and performance of a selection of big data management systems. We will study both batch- and stream-processing systems.

Instructor: Magdalena (magda) Balazinska, magda at cs.washington.edu. Office hour: Mondays 12pm-1pm in CSE584.

TA: Cyrus Rashtchian, cyrash at cs.washington.edu. Office hour: Fridays 9:30am-10:30am in the theory lab (CSE 306).

Lectures: Mondays and Wednesdays -- 9am-10:20am

Location: MGH 251

The workload in the class involves the following:

20 %: Read assigned papers and participate in class discussions.
40 %: Co-organize the demonstration and hands-on tutorial for one big data management system.
- Add your name to the schedule spreadsheet.
- Plan to meet with course instructor a week and a half prior to your demo and tutorial.
40 %: Perform a focused benchmark-study of a set of big data systems. Present the results of the study in class at the end of the quarter. You can work in teams of up to four people. A team of N people should do a comparative study that involves N systems.
- April 17th: One page project idea description.
- May 8th: Two-page project milestone.
- May 31st: Final project presentations.
- June 6th: Final project write-ups due. Formatted as a blog post.

Link to FINAL PROJECTS REPOSITORY.

Link to GRADEBOOK.

Link to DROP BOX. Please use the dropbox to submit your project idea, milestone, and final paper.

An exciting component of this course are practical, hands-on tutorials in class. All tutorial materials are publicly available on GitHub in the following repository: https://github.com/mbalazin/cse599c-17sp-tutorials

Note that this schedule is subject to change, so please check this website regularly for updates.

How it all fits together?

Week 1: Parallel DBMSs & MapReduce

[Read this paper] DeWitt and Gray, "Parallel Database Systems: The Future of High Performance Database Systems," Communications of the ACM. 1992. Sections 1 and 2 only [pdf].
[Skim this paper] DeWitt et al, "The Gamma Database Machine Project," IEEE Transactions on Knowledge and Data Engineering, Volume 2 Issue 1, March 1990, p. 44-62 [pdf].
[Skim this paper] Goetz Graefe. Encapsulation of parallelism in the Volcano query processing system. SIGMOD 1990. [pdf].
[Read this paper] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04. [pdf].

Week 2: Best of Both Worlds Integration

[Skim for Monday] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD 2009: 165-178. [pdf].
[Read for Monday] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005. [pdf].
[Read for Wednesday] Teradata Database: Introduction to Teradata. Release 14.0 Chapter 4 only. [pdf].
[Read for Wednesday] Hybrid Row-Column Partitioning in Teradata Mohammed Al-Kateb, Paul Sinclair, Grace Au, Carrie Ballinger. PVLDB2016 [pdf].
[Skim for Wednesday] Integrating hadoop and parallel DBMs Yu Xu, Pekka Kostamaa, Like Gao. SIGMOD Conference 2010. [pdf].
[Skim for Wednesday] Teradata Aster Database. [pdf].

Week 3: Column-store DBMSs

[Read sections] The Design and Implementation of Modern Column-Oriented Database Systems Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden. Foundations and Trends® in Databases (Vol 5, Issue 3, 2012, pp 197-280) Sections 1, 2, 4 (read 4.1, 4.4., 4.5, skim over the others and skim Section 3). [pdf].
[Read] A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The Vertica Analytic Database: C-store 7 Years Later. Proc. VLDB Endow., 5(12):1790-1801, Aug. 2012. [pdf].

Week 4: In-memory analytics

[Read] M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. [pdf].
[Read] Spark SQL: Relational Data Processing in Spark Michael Armbrust, Reynold Xin, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali Ghodsi, Matei Zaharia. ACM SIGMOD Conference 2015, May. 2015. [pdf].

Week 5: Parallel DBMS on Hadoop

[Read] M. Kornacker et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015. [pdf].

Week 6: University of Washington Big Data Engine

[Read] The Myria Team. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR 2017 [pdf].

Week 7: Machine-Learning Focused Systems

[Read] M.Abadi et al.TensorFlow:Large-scale machine learning on heterogeneous systems. In OSDI, 2016. [pdf].

Week 9: Stream and Batch Processing

[Reading] Streaming 101 and 102 blogs are a good read and 102 takes some time if the students really dig deep. They link transitively to the Millwheel and Spark Streaming papers, which are deeper technical reads and interesting. I'd recommend those two, plus the VLDB paper on Dataflow: http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* Subscription: If you are registered for this class, your email address @u.washington.edu will automatically be added to the class mailing list (refreshed daily). You can setup a forward address at myuw.washington.edu or change your subscription address here.

* Archive: You can access the archive for the class mailing list HERE.

Welcome to CSE 599c: Big Data Management Systems!

Course Staff

Lecture Times and Location

Workload and Grading

System Tutorials

Course Schedule

Readings

Course Mailing List