Logistics

Instructor


Teaching Assistants


Ken Gu
(Head TA)
Yikun Zhang
 
Esteban Safranchik
 
Aniket Rege
 

Content

What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.

This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset

Prerequisites

Students are expected to have the following background:

Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on ed or during Office Hours.

Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).


Accessibility & Accommodations

Embedded in the core values of the University of Washington is a commitment to nstructoensurrng access to a quality higher education experience for a diverse student population. Disability Resources for Students (DRS) recognizes disability as an aspect of diversity that is integral to society and to our campus community. DRS serves as a partner in fostering an inclusive and equitable environment for all University of Washington students. The DRS office is in 011 Mary Gates Hall. Please see the UW resources at the accomodations website.

Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW’s policy, including more information about how to request an accommodation, is available at the religious accommodations policy website.
Accommodations must be requested within the first two weeks of this course using the religious accommodations request form.


Schedule

Note: Lectures will in person in room CSE2 G010.

Lecture slides will be posted here shortly before each lecture.

This schedule is subject to change

Date Description Course Materials Events Deadlines
Tue Jan 3 Introduction; MapReduce and Spark
[slides]
Course Information: handout

Suggested Readings:
  1. Ch1: Data Mining
  2. Ch2: Large-Scale File Systems and Map-Reduce
Start planning course project
[Teams signup form]
Thu Jan 5 Frequent Itemsets Mining
[slides]
Suggested Readings:
  1. Ch6: Frequent itemsets
[Colab 0] [Colab 1] & Assignment 1 [handout, bundle file] web link] out
Thu Jan 5 Recitation: Spark

3:30-5:00 CSE2 371
Tue Jan 10 Locality-Sensitive Hashing I
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.1-3.4)
Tue Jan 10 Recitation: Probability and Proof Techniques
[notes,
web link]
3:30-5:00 CSE2 371 or Zoom
Thu Jan 12 Locality-Sensitive Hashing II
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.5-3.8)
[Colab 2] out Colab 0, Colab 1 due
Thu Jan 12 Recitation: Linear Algebra
[notes,
web link]
3:30-5:00 CSE2 371 or Zoom
Tue Jan 17 Clustering
[slides]
Suggested Readings:
  1. Ch7: Clustering (Sect. 7.1-7.4)
Thu Jan 19 Dimensionality Reduction
[slides]
Suggested Readings:
  1. Ch11: Dimensionality Reduction (Sect. 11.4)
[Colab 3] & Assignment 2 out
[ handout, bundle file web link ]
Colab 2
& Assignment 1 due
Tue Jan 24 Recommender Systems I
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
Thu Jan 26 Recommender Systems II
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
[Colab 4] out
Colab 3 due,
Project Proposal due (no late periods)
Tue Jan 31 PageRank
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.1-5.3, 5.5)
Thu Feb 2 Link Spam and Introduction to Social Networks
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.4)
  2. Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)
[Colab 5]
& Assignment 3 out
[handout, bundle file web link]
Colab 4
& Assignment 2 due
Tue Feb 7 Community Detection in Graphs
[slides]
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.3-10.5)
Thu Feb 9 Graphs Representation Learning
[slides]
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.7-10.8)
[Colab 6] out
Colab 5 due
Sun Feb 12 Project Milestone due (no late periods)
Tue Feb 14 Large-Scale Machine Learning I
[slides]
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Thu Feb 16 Large-Scale Machine Learning II
[slides]
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
[Colab 7]
& Assignment 4 out
[handout, bundle file web link]
Colab 6
& Assignment 3 due
Tue Feb 21 Mining Data Streams I
[slides]
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.1-4.3)
Thu Feb 23 Mining Data Streams II
[slides]
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.4-4.7)
[Colab 8] out
Colab 7 due
Tue Feb 28 Project Office Hours
Thu Mar 2 Optimizing Submodular Functions
[slides]
Suggested Readings:
  1. TimeMachine: Timeline Generation for Knowledge-base Entities by Althoff, Dong, Murphy, Alai, Dang, Zhang. KDD 2015.
[Colab 9] out
Assignment 4 due
Tue Mar 7 Causal Inference I
[slides]
Colab 8 due
Thu Mar 9 Causal Inference II
[slides]
Sun Mar 12 Colab 9 due
& Final Report due & Presentation video due (no late periods)
Mon Mar 13 10:30am-12:20pm PT Project Presentations