B DATA516/CSED516: Scalable Data Systems and Algorithms, Fall 2019

In this course, we will study the specialized systems and algorithms that have been developed to work with data at scale, including parallel database systems, MapReduce and its contemporaries, graph systems, streaming systems, and others. We will also go over core techniques of cloud platforms; and important scalable algorithms.

Instructor:

Dan Suciu (suciu at cs.washington.edu)
Office hours: Wednesday at 5pm (CSE 662)

TA:

Brandon Haynes (bhaynes at cs.washington.edu)
Office hours: Mondays at 5:30pm (CSE 291)

TA:

Deepanshu Gupta (guptad2 at cs.washington.edu)
Office hours: Thursday at 4pm (CSE 220)

Lectures: Tuesdays 5pm - 7:50pm

Sections: Tuesdays 8pm - 8:50pm

Location: Gates Center G04 (Map)

The schedule is subject to change, so please check this website regularly for updates.

How does all this fit together?

Reading assigned papers and writing short statements (15%)

Each statement should be at most one page in length written as a set of bullet points. The statement should demonstrate that you read and thought about the paper.

Three hands-on assignments (45%)

Each assignment involves using a big data systems in the Amazon Web Services cloud (Redshift, Hadoop/Hive, Spark).

Mini hands-on assignments (15%)

Mini-assignments require using other big data systems (Vertica, GraphX, Beam).

Short final project (25%)

See this document for a list of final project ideas.

The final project is open-ended. The only requirement is that it uses a big data system. We recommend one of the following: Build on one of the systems that we used in class and do something more with that system. For example, you can try to benchmark a system's performance on a specific task (e.g., ingesting data, making backups, rescaling a cluster, etc) or on a new set of queries (e.g., try to run queries on a different dataset or try to run new types of queries). Another good type of project is to try a system that we did not have a chance to try in class and report what you find in terms of ease of use and performance. The best projects will ask a question and will vary one or more parameters to see how performance changes. We expect each final report to include at least one graph with some performance numbers (you can have more than one graph). Beyond the one graph, you can also include a qualitative discussion based on what you observed and what you read about in the documentation. The scale of the final project should be similar to the scale of the other three assignments.

The most important component of the final project is the final project report, which is due during finals week. Please submit a 4-5 page report. Use the ACM SIG template (either Word or LaTeX, whichever you prefer). The final project report should have the following sections:

We will also have final presentations in class on the last day. Each person will get 5 minutes to present and should have 3-4 slides. The presentation template is here. The template includes an optional title slide and the three slides that we recommend each presentation to have.

All assignments and projects are to be done individually.

Please submit your readings, assignments, mini-assignments, final project presentation, and final project report by adding them to your GitLab repository.

Each week, after lecture, we will have a 50-minute section that will give you hands-on demonstrations and tutorials of various big data systems and cloud services. Each section will be connected to either a full assignment or a mini assignment.

All homework assignments are stored in your repository, and are pinned on Piazza.

Message Board