Teaching Assistants

Ken Gu
(Head TA)
Dong He
Hao Peng


What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.

This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset


Students are expected to have the following background:

Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on EdDiscussion or during Office Hours.

Recordings of the recitations will be available on Panopto. You will need to login using your UW netid in order to watch the videos.

Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).

Accessibility & Accommodations

Embedded in the core values of the University of Washington is a commitment to ensuring access to a quality higher education experience for a diverse student population. Disability Resources for Students (DRS) recognizes disability as an aspect of diversity that is integral to society and to our campus community. DRS serves as a partner in fostering an inclusive and equitable environment for all University of Washington students. The DRS office is in 011 Mary Gates Hall. Please see the UW resources at: http://depts.washington.edu/uwdrs/current-students/accommodations/.

Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW’s policy, including more information about how to request an accommodation, is available at Religious Accommodations Policy:
Accommodations must be requested within the first two weeks of this course using the Religious Accommodations Request form: (https://registrar.washington.edu/students/religious-accommodations-request/).


Note: Lectures will in person in room CSE2 G010. Zoom links posted on Canvas.
Note: Recordings of the lectures will be available on Panopto.

Lecture slides will be posted here shortly before each lecture.

This schedule is subject to change.

Date Description Course Materials Events Deadlines
Wed Mar 30 Introduction; MapReduce and Spark
Frequent Itemsets Mining
Course Information: handout

Suggested Readings:
  1. Ch1: Data Mining
  2. Ch2: Large-Scale File Systems and Map-Reduce
  3. Ch6: Frequent itemsets
Start planning course project
[Teams signup form]
[Colab 0] [Colab 1] & Assignment 1 [handout, bundle file] out
Fri April 1 Recitation: Spark

7:30 - 8:30pm via Zoom (See Ed/Canvas)
Tues Apr 5 Recitation: Linear Algebra
7:30 - 8:30pm via Zoom (See Ed/Canvas)
Wed Apr 6 Locality Sensitive Hashing
Theory of Locality Sensitive Hashing
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.1-3.4 and 3.5-3.8)
Colab 2 [Colab 2] out Colab 0, Colab 1 due
Thurs Apr 7 Recitation: Probability and Proof Techniques
7:30 - 8:30pm via Zoom (See Ed/Canvas)
Wed Apr 13 Clustering
Dimensionality Reduction
Suggested Readings:
  1. Ch7: Clustering (Sect. 7.1-7.4)
  2. Ch11: Dimensionality Reduction (Sect. 11.4)
Colab 3 [Colab 3] & Assignment 2 out
[handout, bundle file]
Colab 2
& Assignment 1 due
Fri Apr 15 Project Team Signup
[Teams signup form] due
Wed Apr 20 Recommender Systems I
Recommender Systems II
Suggested Readings:
  1. Ch9: Recommendation systems
Colab 4 [Colab 4] out
Colab 3 due ,
Project Proposal due (no late periods)
Wed Apr 27 PageRank
Link Spam and Introduction to Social Networks
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.1-5.5)
  2. Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)
Colab 5 [Colab 5]
& Assignment 3 out
[handout, bundle file]
Sun May 1 Colab 4
& Assignment 2 due
Wed May 4 Community Detection in Graphs
Graphs Representation Learning
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.3-10.5, 10.7-10.8)
Colab 6 [Colab 6] out
Sun May 8 Project Milestone & Colab 5 due (no late periods)
Wed May 11 Large-Scale Machine Learning
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Colab 7 [Colab 7]
& Assignment 4 out
[handout, bundle file]
Sun May 15 Colab 6
& Assignment 3 due
Wed May 18 Mining Data Streams
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.1-4.7)
Colab 8 [Colab 8] out
Sun May 22 Colab 7 due
Wed May 25 Course Project Meetings
Optimizing Submodular Functions
Sign up for meeting slots on EdDiscussion
Suggested Readings:
  1. TimeMachine: Timeline Generation for Knowledge-base Entities by Althoff, Dong, Murphy, Alai, Dang, Zhang. KDD 2015.
Colab 9 [Colab 9] out
Sun May 29 Assignment 4 due
Wed Jun 1 Causal Inference
Colab 8 due (6pm)
Sunday Jun 5 Colab 9 due
& Final Report due & Presentation video due (no late periods)
Mon June 6 6:30pm - 9:20pm PT (On zoom, see canvas) Virtual Project Presentations