Logistics

Instructor

Teaching Assistants

Alex Okeson
Alon Milchgrub
Jessica Perry
Mathew Luo
Nicasia Beebe-Wang
Swati Padmanabhan

Content

What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.

This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset

Prerequisites

Students are expected to have the following background:

Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on Piazza or during Office Hours.

Recordings of the recitations are available on Panopto. You will need to login using your UW netid in order to watch the videos. Note: the audio might be a little off, particularly in the Spark recitation video, due to technical difficulties.

Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).


Schedule

Lecture slides will be posted here shortly before each lecture.

This schedule is subject to change.

Date Description Course Materials Events Deadlines
Tue Apr 2 Introduction; MapReduce and Spark
[slides]
Suggested Readings:
  1. Ch1: Data Mining
  2. Ch2: Large-Scale File Systems and Map-Reduce
Assignment 0 out
[handout, bundle file] Start planning course project
[Teams signup form]
Thu Apr 4 Frequent Itemsets Mining
[slides]
Suggested Readings:
  1. Ch6: Frequent itemsets
Assignment 1 out
[handout, bundle file]
Thu Apr 4 Recitation: Spark
[slides]
[video recording]
2:30-4:30pm in GWN 201
Tue Apr 9 Locality-Sensitive Hashing I
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.1-3.4)
Tue Apr 9 Recitation: Probability and Proof Techniques
[notes]
[video recording]
3:30-5:20pm in PAA A102
Thu Apr 11 Locality-Sensitive Hashing II
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.5-3.8)
Thu Apr 11 Recitation: Linear Algebra
[notes]
due to technical difficulties, no video recording for this recitation
3:30-5:20pm in SIG 134
Tue Apr 16 Clustering
[slides]
Suggested Readings:
  1. Ch7: Clustering (Sect. 7.1-7.4)
Thu Apr 18 Dimensionality Reduction
[slides]
Suggested Readings:
  1. Ch11: Dimensionality Reduction (Sect. 11.4)
Assignment 2 out
[handout, bundle file]
Assignment 0 & Assignment 1 due
Tue Apr 23 Recommender Systems I
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
Thu Apr 25 Recommender Systems II
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
Project Proposal due (no late periods)
Tue Apr 30 PageRank
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.1-5.3, 5.5)
Thu May 2 Link Spam and Introduction to Social Networks
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.4)
  2. Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)
Assignment 3 out
[handout, bundle file]
Assignment 2 due
Tue May 7 Community Detection in Graphs
[slides]
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.3-10.5)
Thu May 9 Graphs Representation Learning
[slides]
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.7-10.8)
Project Milestone due (no late periods)
Tue May 14 Guest Lecture: Jevin West - Memory in large networks and mining the scientific literature
[slides]
Suggested Readings:
  1. Memory in network flows and its effects on spreading dynamics and community detection
Thu May 16 Guest Lecture: Su-In Lee - Explainable Machine Learning in Precision Medicine
Assignment 4 out
[handout, bundle file]
Assignment 3 due
Tue May 21 Large-Scale Machine Learning I
[slides]
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Thu May 23 Large-Scale Machine Learning II
[slides]
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Tue May 28 Mining Data Streams I
[slides]
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.1-4.3)
Thu May 30 Mining Data Streams II
[slides]
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.4-4.7)
Sat Jun 1 Assignment 4 due
Tue Jun 4 Course Project Meetings Sign up for meeting slots on Piazza
Thu Jun 6 Optimizing Submodular Functions
[slides]
Suggested Readings:
  1. TimeMachine: Timeline Generation for Knowledge-base Entities by Althoff, Dong, Murphy, Alai, Dang, Zhang. KDD 2015.
Sun Jun 9 Final Report due (no late periods)
Mon Jun 10 Allen Center Atrium, 10:00am-1:00pm Poster Presentations Poster upload to Gradescope due at 10am (no late periods)