Logistics

Instructor


Teaching Assistants


Yikun Zhang
(Head TA)
Zhitao Yu
 
Mingyu Lu
 
William Howard-
Snyder  
Oscar Liu
 

Content

What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.

This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset

Prerequisites

Students are expected to have the following background:

Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on ed or during Office Hours.

Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).


Accessibility & Accommodations

Embedded in the core values of the University of Washington is a commitment to ensurring access to a quality higher education experience for a diverse student population. Disability Resources for Students (DRS) recognizes disability as an aspect of diversity that is integral to society and to our campus community. DRS serves as a partner in fostering an inclusive and equitable environment for all University of Washington students. The DRS office is in 011 Mary Gates Hall. Please see the UW resources at the accomodations website.

Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW’s policy, including more information about how to request an accommodation, is available at the religious accommodations policy website.
Accommodations must be requested within the first two weeks of this course using the religious accommodations request form.


Use of ChatGPT and Related Tools:

It is hard to find someone who has not heard about ChatGPT and related tools, and these tools are undeniably useful for generating ideas, providing suggestions, and more. In this class, we will ask you to follow these ethical guidelines when using generative AI such as ChatGPT.

1. Whether you use generative AI or read someone’s blog post or code, you are expected to ensure that the final product is your own, original work.
2. You may use generative AI for assignments and exams unless I specify that it may not be used. If guidelines are provided you should additionally follow those guidelines.
3. Unlike blog posts and research articles, you do not need to attribute artifacts/quote text or label code produced by generative AI when you use it. However, you must do the following or you will face academic consequences including but not limited to failing an assignment or an exam.
- You must cite the AI program you used in the artifact you hand in. You must also add the prompts you used to your assignment or exam in an appendix.
- If it copies text from other sources and you don’t provide proper attribution, you will be held accountable for that.
- If it provides ideas that are not your own, you will need to find and properly cite the original source.
- You must still comply with the academic integrity policies of the institution. This includes refraining from using generative AI to plagiarize or cheat.
4. You will be held to the same standards when you use generative AI as for any assignment, regardless of whether you or the AI created something, including:
- If you turn in artifacts that contain false or incomplete claims, you will be graded accordingly
- If you turn in code that does not compile or is incomplete, you will be graded accordingly
- You will be graded based on the critical thinking and writing skills, accuracy, and accessibility of the things that you produce.

We recommend that you use generative AI in moderation. Generative AI can help you to summarize text, improve grammar, write code, collect relevant resources to read, and generate ideas. However if you start to rely solely on it, you may limit your own development in critical thinking and writing, and if the results are that your writing is narrower and shallower in scope this may impact your grade.

Also please note that using such tools may imply donating your data to the companies that deployed them. Please take reasonable steps to avoid making our assignments easier in future iterations of the course (e.g., once the tool provides a correct answer, don’t give it positive feedback). To summarize, you may use generative AI unless otherwise specified. However, you must use it ethically, check its work, and ensure that you do not cheat or plagiarize when using it. Further, you will most likely not receive a high grade if you rely on it to the exclusion of your own critical thinking, writing and other skills.

This policy was written by Jen Mankoff with minor adaptations.


Schedule

Note: Lectures will in person in room CSE2 G01.

Lecture slides will be posted here shortly before each lecture.

This schedule is subject to change

Date Description Course Materials Events Deadlines
Tue March 26 Introduction; MapReduce and Spark
[slides]
Suggested Readings:
  1. Ch1: Data Mining
  2. Ch2: Large-Scale File Systems and Map-Reduce
Start planning course project
[Teams signup form]
Tue March 26 Recitation: Spark

3:30-5:00 CSE2 G01 [Colab 0] [Colab 1] out
Thu March 28 Frequent Itemsets Mining
[slides]
Suggested Readings:
  1. Ch6: Frequent itemsets
Assignment 1 [handout, bundle file] web link] out
Thu March 28 Recitation: Probability and Proof Techniques
[notes]
web link]
3:30-5:00 CSE2 G04
Tue April 2 Locality-Sensitive Hashing I
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.1-3.4)
Tue April 2 Recitation: Linear Algebra
[notes]
web link]
3:30-5:00 CSE2 G01
Thu April 4 Locality-Sensitive Hashing II
[slides]
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.5-3.8)
[Colab 2] Colab 2 out Colab 0, Colab 1 due
Thu April 4 Recitation: Big Data Trick
[Colab Big Data]
3:30-5:00 CSE2 G04
Tue April 9 Clustering
[slides]
Suggested Readings:
  1. Ch7: Clustering (Sect. 7.1-7.4)
Thu April 11 Dimensionality Reduction
[slides]
Suggested Readings:
  1. Ch11: Dimensionality Reduction (Sect. 11.4)
[Colab 3] Colab 3 & Assignment 2 out
[ handout, bundle file web link ]
Colab 2
& Assignment 1 due
Tue April 16 Recommender Systems I
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
Thu April 18 Recommender Systems II
[slides]
Suggested Readings:
  1. Ch9: Recommendation systems
[Colab 4] Colab 4 out
Colab 3 due,
Project Proposal due (no late periods)
Tue April 23 PageRank
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.1-5.3, 5.5)
Thu April 25 Link Spam and Introduction to Social Networks
[slides]
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.4)
  2. Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)
[Colab 5] Colab 5
& Assignment 3 out
[handout, bundle file web link]
Assignment 2 due
Fri April 26
Colab 4 due
Tue April 30 Community Detection in Graphs
[slides]
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.3-10.5)
Thu May 2 Graphs Representation Learning
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.7-10.8)
[Colab 6] Colab 6 out
Fri May 3
Colab 5 due
Sun May 5 Project Milestone due (no late periods)
Tue May 7 Large-Scale Machine Learning I
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Thu May 9 Large-Scale Machine Learning II
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Colab 7
& Assignment 4 out
Assignment 3 due
Fri May 10
Colab 6 due
Tue May 14 Mining Data Streams I
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.1-4.3)
Thu May 16 Mining Data Streams II
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.4-4.7)
Colab 8 out
Colab 7 due
Fri May 17
Colab 7 due
Tue May 21 Project Office Hours
Thu May 23 Optimizing Submodular Functions
Suggested Readings:
  1. TimeMachine: Timeline Generation for Knowledge-base Entities by Althoff, Dong, Murphy, Alai, Dang, Zhang. KDD 2015.
Colab 9 out
Assignment 4 due
Tue May 28 Causal Inference I
Colab 8 due
Thu May 30 Causal Inference II
Sun June 2 Colab 9 due
& Final Report due & Presentation video due (no late periods)
Mon June 3 10:30am-12:20pm PT Project Presentations