Logistics

Instructor


Teaching Assistants


Ashish Sharma
(Head TA)
Ayse Berceste Dincer
 
Zian Fu
 
Qifan Huang
 
Andrew Wei
 

Content

What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.

This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset

Prerequisites

Students are expected to have the following background:

Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on EdDiscussion or during Office Hours.

Recordings of the recitations will be made available on Panopto.

Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).


Schedule

Note: Lectures will be conducted via Zoom. Links posted on Canvas.

Lecture slides will be posted here shortly before each lecture.

This schedule is subject to change.

Date Description Course Materials Events Deadlines
Tue Mar 30 Introduction; MapReduce and Spark
[slides]
Course Information: handout

Suggested Readings:
  1. Ch1: Data Mining
  2. Ch2: Large-Scale File Systems and Map-Reduce
Start planning course project
[Teams signup form]
Thu Apr 1 Frequent Itemsets Mining
[slides]
Suggested Readings:
  1. Ch6: Frequent itemsets
[Colab 0] [Colab 1] & Assignment 1 [handout, bundle file] out
Thu Apr 1 Recitation: Spark

4:00-6:00pm via Zoom
Tue Apr 6 Locality-Sensitive Hashing I
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.1-3.4)
Tue Apr 6 Recitation: Probability and Proof Techniques
1:00-3:00pm via Zoom
Thu Apr 8 Locality-Sensitive Hashing II
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.5-3.8)
[Colab 2] out Colab 0, Colab 1 due
Thu Apr 8 Recitation: Linear Algebra
1:00-3:00pm via Zoom
Tue Apr 13 Clustering
[slides]
Suggested Readings:
  1. Ch7: Clustering (Sect. 7.1-7.4)
Thu Apr 15 Dimensionality Reduction
[slides]
Suggested Readings:
  1. Ch11: Dimensionality Reduction (Sect. 11.4)
[Colab 3] & Assignment 2 out
[handout, bundle file]
Colab 2
& Assignment 1 due
Tue Apr 20 Recommender Systems I
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
Thu Apr 22 Recommender Systems II
Suggested Readings:
  1. Ch9: Recommendation systems
[Colab 4] out
Colab 3 due,
Project Proposal due (no late periods)
Tue Apr 27 PageRank
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.1-5.3, 5.5)
Thu April 29 Link Spam and Introduction to Social Networks
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.4)
  2. Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)
[Colab 5]
& Assignment 3 out
Colab 4
& Assignment 2 due
Tue May 4 Community Detection in Graphs
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.3-10.5)
Thu May 6 Graphs Representation Learning
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.7-10.8)
[Colab 6] out
Colab 5 due,
Project Milestone due (no late periods)
Tue May 11 Large-Scale Machine Learning I
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Thu May 13 Large-Scale Machine Learning II
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
[Colab 7]
& Assignment 4 out
Colab 6
& Assignment 3 due
Tue May 18 Mining Data Streams I
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.1-4.3)
Thu May 20 Mining Data Streams II
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.4-4.7)
[Colab 8] out
Colab 7 due
Tue May 25 Course Project Meetings (optional)
Sign up for meeting slots on EdDiscussion
Thu May 27 Optimizing Submodular Functions
Suggested Readings:
  1. TimeMachine: Timeline Generation for Knowledge-base Entities by Althoff, Dong, Murphy, Alai, Dang, Zhang. KDD 2015.
[Colab 9] out
Assignment 4 due
Tue Jun 1 Causal Inference I
Colab 8 due
Thu Jun 3 Causal Inference II
Sun Jun 6 Colab 9 due
& Final Report due & Presentation video due (no late periods)
Mon Jun 7 10:00am-1:00pm Virtual Project Presentations