Logistics

Instructor



Content

What is this course about?

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.

This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset

Prerequisites

Students are expected to have the following background:

Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on ed or during Office Hours.

Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).


Accessibility & Accommodations

Embedded in the core values of the University of Washington is a commitment to ensurring access to a quality higher education experience for a diverse student population. Disability Resources for Students (DRS) recognizes disability as an aspect of diversity that is integral to society and to our campus community. DRS serves as a partner in fostering an inclusive and equitable environment for all University of Washington students. The DRS office is in 011 Mary Gates Hall. Please see the UW resources at the accomodations website.

Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW's policy, including more information about how to request an accommodation, is available at the religious accommodations policy website.
Accommodations must be requested within the first two weeks of this course using the religious accommodations request form.


Use of ChatGPT and Related Tools:

It is hard to find someone who has not heard about ChatGPT and related tools, and these tools are undeniably useful for generating ideas, providing suggestions, and more. In this class, we will ask you to follow these ethical guidelines when using generative AI such as ChatGPT.

1. Whether you use generative AI or read someone's blog post or code, you are expected to ensure that the final product is your own, original work.
2. You may use generative AI for assignments and exams unless I specify that it may not be used. If guidelines are provided you should additionally follow those guidelines.
3. Unlike blog posts and research articles, you do not need to attribute artifacts/quote text or label code produced by generative AI when you use it. However, you must do the following or you will face academic consequences including but not limited to failing an assignment or an exam.
- You must cite the AI program you used in the artifact you hand in. You must also add the prompts you used to your assignment or exam in an appendix.
- If it copies text from other sources and you don't provide proper attribution, you will be held accountable for that.
- If it provides ideas that are not your own, you will need to find and properly cite the original source.
- You must still comply with the academic integrity policies of the institution. This includes refraining from using generative AI to plagiarize or cheat.
4. You will be held to the same standards when you use generative AI as for any assignment, regardless of whether you or the AI created something, including:
- If you turn in artifacts that contain false or incomplete claims, you will be graded accordingly
- If you turn in code that does not compile or is incomplete, you will be graded accordingly
- You will be graded based on the critical thinking and writing skills, accuracy, and accessibility of the things that you produce.

We recommend that you use generative AI in moderation. Generative AI can help you to summarize text, improve grammar, write code, collect relevant resources to read, and generate ideas. However if you start to rely solely on it, you may limit your own development in critical thinking and writing, and if the results are that your writing is narrower and shallower in scope this may impact your grade.

Also please note that using such tools may imply donating your data to the companies that deployed them. Please take reasonable steps to avoid making our assignments easier in future iterations of the course (e.g., once the tool provides a correct answer, don’t give it positive feedback). To summarize, you may use generative AI unless otherwise specified. However, you must use it ethically, check its work, and ensure that you do not cheat or plagiarize when using it. Further, you will most likely not receive a high grade if you rely on it to the exclusion of your own critical thinking, writing and other skills.

This policy was written by Jen Mankoff with minor adaptations.


Schedule

Note: Lectures will in person in room CSE2 G10.

Lecture slides will be posted here shortly before each lecture.

This schedule is subject to change.

c c
Date Description Course Materials Events Deadlines
Thurs Apr 3 Introduction; MapReduce and Spark
Frequent Itemsets Mining
Suggested Readings:
  1. Ch1: Data Mining
  2. Ch2: Large-Scale File Systems and Map-Reduce
  3. Ch6: Frequent itemsets
Start planning course project
[Teams signup form]
[Colab 0] [Colab 1] & Assignment 1 [handout, bundle file,web link] out
Fri Apr 4 Recitation: Spark
slides
7:30 - 8:30pm via Zoom (See Ed/Canvas)
Wed Apr 9 Recitation: Probability and Proof Techniques
Notes [ HTML Version ]
7:30 - 8:30pm via Zoom (See Ed/Canvas) Colab 0, Colab 1 due
Thurs Apr 10 Locality Sensitive Hashing
Theory of Locality Sensitive Hashing
Suggested Readings:
  1. Ch3: Finding Similar Items (Sect. 3.1-3.4 and 3.5-3.8)
Colab 2 out
Fri Apr 11 Recitation: Linear Algebra
Notes [ HTML Version ]
7:30 - 8:30pm via Zoom (See Ed/Canvas)
Wed Apr 16 Recitation: Big Data Tricks
[Colab Big Data]
7:30 - 8:30pm via Zoom (See Ed/Canvas) Colab 2
& Assignment 1 due
Thurs Apr 17 Clustering
Dimensionality Reduction
Suggested Readings:
  1. Ch7: Clustering (Sect. 7.1-7.4)
  2. Ch11: Dimensionality Reduction (Sect. 11.4)
Colab 3 & Assignment 2
Sat Apr 19 Project Team Signup
[Teams signup form] due
Sun Apr 20 Project Proposal
Project Proposal due (no late periods)
Wed Apr 23 Colab 3 due
Thurs Apr 24 Recommender Systems I
Recommender Systems II
Suggested Readings:
  1. Ch9: Recommendation systems
Colab 4 out
Thurs May 1 PageRank
Link Spam and Introduction to Social Networks
Suggested Readings:
  1. Ch5: Link Analysis (Sect. 5.1-5.5)
  2. Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6)
Colab 5 Assignment 3 out
Wed May 7 Colab 4
& Assignment 2 due
Thurs May 8 Community Detection in Graphs
Graphs Representation Learning
Suggested Readings:
  1. Ch10: Analysis of Social Networks (Sect. 10.3-10.5, 10.7-10.8)
Colab 6 out
Sun May 11 Project Milestone due (no late periods)
Wed May 14 Colab 5 due
Thurs May 15 Large-Scale Machine Learning
Suggested Readings:
  1. Ch12: Large-Scale Machine Learning
Colab 7
& Assignment 4 out
Wed May 21 Colab 6
& Assignment 3 due
Thurs May 22 Mining Data Streams
Suggested Readings:
  1. Ch4: Mining data streams (Sect. 4.1-4.7)
Colab 8 out
Wed May 28 Colab 7 due
Thurs May 29 Course Project Meetings
Optimizing Submodular Functions
Suggested Readings:
  1. TimeMachine: Timeline Generation for Knowledge-base Entities by Althoff, Dong, Murphy, Alai, Dang, Zhang. KDD 2015.
Colab 9 out
Wed Jun 4 Assignment 4 & Colab 8 due
Thurs Jun 5 Causal Inference
Sun Jun 8 Colab 9 due
& Final Report due (no late periods)
Wed Jun 11 Presentation video due (no late periods)
Thurs Jun 12 6:30pm - 9:20pm PT (TBD) Project Presentations