Content
What is this course about?
The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data.
This course is modeled after CS246: Mining Massive Datasets by Jure Leskovec at Stanford University.
Reference Text
The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset
Prerequisites
Students are expected to have the following background:
- Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS332, CS373 or equivalent are recommended).
- Good knowledge of Python and Java will be extremely helpful since several assignments will require the use of Spark/Hadoop.
- Familiarity with basic probability theory (any introductory probability course).
- Familiarity with writing rigorous proofs (e.g., CS311 or equivalent).
- Familiarity with basic linear algebra (e.g., Math 308 or equivalent).
- Familiarity with algorithmic analysis (e.g., CS332/CS373; CS417/CS421 would be more than necessary).
Students may refer to the following materials for an overview and review of the expected background. Related questions can be posted on EdDiscussion or during Office Hours.
- Probability and Proof Techniques
- Linear Algebra
- Spark Tutorial (a video is available through Stanford CS246)
Recordings of the recitations are available on Panopto. You will need to login using your UW netid in order to watch the videos. Note: the audio might be a little off, particularly in the Spark recitation video, due to technical difficulties.
Students may decide to enroll without knowledge of these prerequisites but expect an significant increase in work load to learn these concurrently (e.g. 10 hours per week per missing prerequisite as a rule of thumb).