CS547 | Home

Note

Classes for Spring 2020 will be conducted via Zoom. Check the course handout for more information.

Add Code

Based on current enrollment, we expect to be able to enroll everyone with reasonable prerequisites. To request an add code, please fill out this form.

Note that this course will be significantly different from previous years (2018 and before). See the course schedule for details.

Logistics

Lectures: Tuesdays and Thursdays 10:00-11:20am via Zoom. You can find links to the Zoom lectures on Canvas.
Public resources: The lecture slides and assignments will be posted online as the course progresses. We are happy for anyone to use these resources, but we cannot grade the work of any students who are not officially enrolled in the class.
Office hours: Information here.
Contact: Students should ask all course-related questions in the EdDiscussion forum, where you will also find announcements. For external enquiries, personal matters, or in emergencies, you can email us at cse547-instructors@cs.washington.edu. Please do not use other emails (such as the TAs' UW email IDs) for questions about the class, as these may not be answered in a timely manner.

Content

What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.

This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset

Prerequisites

Students are expected to have the following background:

Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS332, CS373 or equivalent are recommended).
Good knowledge of Python and Java will be extremely helpful since several assignments will require the use of Spark/Hadoop.
Familiarity with basic probability theory (any introductory probability course).
Familiarity with writing rigorous proofs (e.g., CS311 or equivalent).
Familiarity with basic linear algebra (e.g., Math 308 or equivalent).
Familiarity with algorithmic analysis (e.g., CS332/CS373; CS417/CS421 would be more than necessary).

Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on EdDiscussion or during Office Hours.

Probability and Proof Techniques
Linear Algebra
Spark Tutorial (a video is available through Stanford CS246)

Recordings of the recitations are available on Panopto. You will need to login using your UW netid in order to watch the videos. Note: the audio might be a little off, particularly in the Spark recitation video, due to technical difficulties.

Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).

Schedule

Note: Lectures will be conducted via Zoom. Links posted on Canvas.

Lecture slides will be posted here shortly before each lecture.

This schedule is subject to change.

Date	Description	Course Materials	Events	Deadlines
Tue Mar 31	Introduction; MapReduce and Spark [slides]	Course Information: handout Suggested Readings: Ch1: Data Mining Ch2: Large-Scale File Systems and Map-Reduce	Start planning course project [Teams signup form]
Thu Apr 2	Frequent Itemsets Mining [slides]	Suggested Readings: Ch6: Frequent itemsets	Colab 0, Colab 1 [Colab 0] [Colab 1] & Assignment 1 out [handout, bundle file]
Thu Apr 2	Recitation: Spark	1:00-3:00pm via Zoom
Tue Apr 7	Locality-Sensitive Hashing I [slides]	Suggested Readings: Ch3: Finding Similar Items (Sect. 3.1-3.4)
Tue Apr 7	Recitation: Probability and Proof Techniques	3:30-5:30pm via Zoom
Thu Apr 9	Locality-Sensitive Hashing II [slides]	Suggested Readings: Ch3: Finding Similar Items (Sect. 3.5-3.8)	[Colab 2] out	Colab 0, Colab 1 due
Thu Apr 9	Recitation: Linear Algebra [notes] taken from 2019	1:00-3:00pm via Zoom
Tue Apr 14	Clustering [slides]	Suggested Readings: Ch7: Clustering (Sect. 7.1-7.4)
Thu Apr 16	Dimensionality Reduction [slides]	Suggested Readings: Ch11: Dimensionality Reduction (Sect. 11.4)	[Colab 3] & Assignment 2 out [handout, bundle file]	Colab 2 & Assignment 1 due
Tue Apr 21	Recommender Systems I [slides]	Suggested Readings: Ch9: Recommendation systems
Thu Apr 23	Recommender Systems II [slides]	Suggested Readings: Ch9: Recommendation systems	[Colab 4] out	Colab 3 due, Project Proposal due (no late periods)
Tue Apr 28	PageRank [slides]	Suggested Readings: Ch5: Link Analysis (Sect. 5.1-5.3, 5.5)
Thu April 30	Link Spam and Introduction to Social Networks [slides]	Suggested Readings: Ch5: Link Analysis (Sect. 5.4) Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)	[Colab 5] & Assignment 3 out [handout, bundle file]	Colab 4 & Assignment 2 due
Tue May 5	Community Detection in Graphs [slides]	Suggested Readings: Ch10: Analysis of Social Networks (Sect. 10.3-10.5)
Thu May 7	Graphs Representation Learning [slides]	Suggested Readings: Ch10: Analysis of Social Networks (Sect. 10.7-10.8)	[Colab 6] out	Colab 5 due, Project Milestone due on Sunday (May 10) midnight (no late periods)
Tue May 12	Large-Scale Machine Learning I [slides]	Suggested Readings: Ch12: Large-Scale Machine Learning
Thu May 14	Large-Scale Machine Learning II [slides]	Suggested Readings: Ch12: Large-Scale Machine Learning	[Colab 7] & Assignment 4 out [handout, bundle file]	Colab 6 & Assignment 3 due
Tue May 19	Mining Data Streams I [slides]	Suggested Readings: Ch4: Mining data streams (Sect. 4.1-4.3)
Thu May 21	Mining Data Streams II [slides]	Suggested Readings: Ch4: Mining data streams (Sect. 4.4-4.7)	[Colab 8] out	Colab 7 due
Tue May 26	Course Project Meetings (optional)	Sign up for meeting slots on EdDiscussion
Thu May 28	Optimizing Submodular Functions [slides]	Suggested Readings: TimeMachine: Timeline Generation for Knowledge-base Entities by Althoff, Dong, Murphy, Alai, Dang, Zhang. KDD 2015.	[Colab 9] out	Assignment 4 due
Tue Jun 2	Causal Inference I [slides]			Colab 8 due
Thu Jun 4	Causal Inference II [slides]
Sun Jun 7				Colab 9 due & Final Report due & Presentation video due (no late periods)
Mon Jun 8		10:00am-1:00pm	Virtual Project Presentations