CSE 446, Winter 2018

Machine Learning

Instructor: Sham Kakade

TAs: Kousuke Ariga, Benjamin Evans, Xingfan Huang, Sean Jaffe, Vardhman Mehta, Patrick Spieker, Jeannette Yu, Kaiyu Zheng.

Contact: cse446-staff@cs.washington.edu

PLEASE COMMUNICATE TO THE INSTUCTOR AND TAs ONLY THROUGH THIS EMAIL (unless there is a reason for privacy in your email).

Discussion: Canvas discussion board

Class lectures: MWF 9:30-10:20am, Room: SIG 134

Office Hours:

Please double check the website before you arrive for location changes/cancellations.

Kousuke Ariga: Wednesday 1:30-2:30pm 2nd floor breakout

Benjamin Evans: Tuesday ~~9:30-10:30am CSE 021~~ 9:30am-12pm, 3pm-4pm CSE 614 for last OH

Xingfan Huang: Tuesday 11:00-12:00pm CSE 021

Sean Jaffe: Thursday 2:00-3:00 pm CSE 007

Sham Kakade: Monday 2:45-4:15, CSE 436

Vardhman Mehta: Friday 2:30-3:30pm CSE 007

Patrick Spieker: Thursday 12:30pm-1:20pm CSE 021

Jeannette Yu: Wednesday 11:30am-12:30pm CSE 021

Kaiyu Zheng: ~~Monday 11:00-12:00pm CSE 021 9:30am-12pm,~~ Tuesday, 11am-12pm CSE 614 for last OH

About the Course and Prerequisites

Machine learning explores the study and construction of algorithms that can learn from data. This study combines ideas from both computer science and statistics. The study of learning from data is playing an increasingly important role in numerous areas of science and technology.

This course is designed to provide a thorough grounding in the fundamental methodologies and algorithms of machine learning. The topics of the course draw from classical statistics, from machine learning, from data mining, from Bayesian statistics, and from optimization.

Prerequisites: Students entering the class should be comfortable with programming (e.g. python) and should have a pre-existing working knowledge of probability, statistics, algorithms, and linear algebra.

Discussion Forum and Email Communication

IMPORTANT: All class announcements will be broadcasted using Canvas. Please send questions about homeworks, projects and lectures to the Canvas discussion board . If you have a question of personal matters, please email the instructors list: cse446-staff@cs.washington.edu.

Material and textbooks

The primary reading assignments will be from the following two books:

A Course in Machine Learning, Hal Daume.

Machine Learning: A Probabilistic Perspective, Kevin Murphy.

Other helpful textbooks are:

From a more theoretical perspective: Understanding Machine Learning: From Theory to Algorithms, Shai Shalev-Shwartz and Shai Ben-David.

More statistical: The Elements of Statistical Learning: Data Mining, Inference, and Prediction Trevor Hastie, Robert Tibshirani, Jerome Friedman.

A little more Bayesian: Pattern Recognition and Machine Learning, Chris Bishop.

From an AI angle: Machine Learning , Tom Mitchell.

Policies

Grades will be based on four assignments (40%), a midterm (20%), and a final (40%). NEW: we will also consider another weighting scheme of assignments (60%), a midterm (15%), and a final (25%), and we will take the max of these two schemes. Extra credit will be included after the max is taken, in manner that is weighted the same regardless of the the weighting scheme used. This is to encourage students to actively work on the HWs (including the Extra Credit). In a small number of cases, grades may be adjusted after this breakdown, e.g. grades will (significantly) drop based on failure to submit all the HWs; grades may go up for particularly remarkable exam scores; grades may go up for consistently remarkable homeworks.

Exams:

If you are not able to make the exam dates (and do not have an exception based on UW policies), then do not enroll in the course. Exams will not be given on alternative dates.

Homeworks:

Homework must be done individually: each student must hand in their own answers. In addition, each student must submit their own code in the programming part of the assignment (we may run your code). It is acceptable for students to discuss problems with each other; it is not acceptable for students to look at another students written answers. It is acceptable for students to discuss coding questions with others; it is not acceptable for students to look at another students code. You must also indicate on each homework with whom you collaborated with.

We expect the students not to copy, refer to, or seek out solutions in published material on the web or from other textbooks (or solutions from previous years or other courses) when preparing their answers. Students are certainly encouraged to read extra material for a deeper understanding. If you do happen to find an assignment's answer, it must be acknowledged clearly with an appropriate citation on the submitted solution.

HW LATE POLICY: Homeworks must be submitted by the posted due date. You are allowed up to 2 LATE DAYs for the homeworks throughout the entire quarter, which will automatically be deducted if your assignment is late. In particular, for any day in which an assignment is late by up to 24 hours, then one late day will be used (up to two late days). After two of the late days are used up, any assignment turned in late will incur a reduction of 33% in the final score, for each day (or part thereof) if it is late. For example, if an assignment is up to 24 hours late, it incurs a penalty of 33%. Else if it is up to 48 hours late, it incurs a penalty of 66%. And any longer, it will receive no credit.

Academic and Personal Integrity

The instructor expects (and believes) that each student will conduct himself or herself with integrity. While the TAs will follow the course and university policies with regards to grading and proctoring, it is ultimately up to you to conduct yourself with academic and personal integrity for a number of important reasons.

Diversity and Gender in STEM

While many academic disciplines have historically been dominated by one cross section of society, the study and participation of STEM disciplines is a joy that the instructor hopes that everyone can pursue. It is not obvious to the instructor what the best solution is. At the least, the instructor encourages students to both be mindful of these issues and, in good faith, try to take steps to fix them. You are the next generation here.

Readings

The required readings are for your benefit and they encompass material that you are required to understand. The extra reading is provided to give you additional background. Please do the required readings before each class.

Section Materials

Week 1 - Section 1: Python review
- Basics and packages (numpy, pandas, matplotlib): [slides]
- Virtual environment: [slides][handout]

Week 2 - Section 2: Linear algebra review I, expected value, notations
- Linear algebra basics in Jupyter Notebook: [HTML]
- Expected value, notations: [slides]

Week 3 - Section 3: Linear algebra review II, probability, Bayesian optimal classification
- Notes on inner/outer product, projection, probability, etc.: [pdf]

Week 4 - Section 4: margin of separability, principal component analysis overview
- PCA Jupyter Notebook [HTML]

Week 5 - Section 5: Midterm review

Week 6 - No section

Week 7 - Section 6: GD and SGD clarifications

Week 8 - Section 7: PyTorch Quick Overview
- PyTorch Introduction with Comparison to Tensorflow [slides]
- PyTorch Jupyter Notebook [HTML]

Week 9 - Section 8: Neural Nets and PyTorch review
- PyTorch Neural Net (XOR) Jupyter Notebook [HTML]

Lecture Notes and Readings

Week 1: [Jan 3] Introduction.
- Logistics.
- What is Machine Learning?
- Lectures: [slides] [annotated slides]
- QZ: Python review
- Reading:
  - Murphy: 1.1 - 1.4

[Jan 5] Decision Trees and Supervised Learning
- Our treatment of supervised learning will begin.
- Lectures: [slides][annotated slides]
- Reading:
  - CIML: Ch. 1
  - Murphy: 16.2

Week 2: [Jan 8] The Supervised Learning Problem Setting
- Lectures: [slides] [annotated slides]
- Reading:
  - (same as last time, CIML: Ch. 1, **Make sure you read/understand the "Math Review: Expected Values" Box on page 15**)
  - [Overfitting]
  - [The Central Limit Theorem]: understand the statement and how it relates to (and quantifies the rate of) the law of large numbers.
- Extra Readings:
  - [Train, Test, and Dev sets] (terminology of dev and validations sets is not standard).
  - [Generalization error] Think of f_n on the wikipage as what the algorithm returns with n samples.
  - other slides: generalization and overfitting presentation is good.

[Jan 10] Limits of Learning and Inductive Bias
- Lectures: [slides] [annotated slides]
- Reading:
  - CIML: Ch. 2
- Extra Readings:
  - [No Free Lunch Theorem] (somewhat trivial theorem though conceptually important, in that it stresses the idea of needing bias in our algorithms).
  - [Inductive bias]

[Jan 12] Geometry: Nearest Neighbors and K-means
- Lectures: [slides] [annotated slides]
- Reading:
  - CIML: Ch. 3
  - Murphy: k-means 11.4.2.5
- Extra Readings:
  - Murphy: more k-means 11.4.2.6, 11.4.2.7,

Week 3: [Jan 17] The Perceptron Algorithm
- Lectures: [slides] [annotated slides]
- Reading:
  - CIML: Ch. 4

[Jan 19] The perceptron algorithm convergence proof; voting
- Lectures: [slides][annotated slides]
- Extra reading:
  - Proof of the perceptron mistake lemma: [notes]

Week 4: [Jan 22] Unsupervised Learning
- Lectures: [slides][annotated slides]
- Reading:
  - CIML: Ch. 15

[Jan 24] Unsupervised learning: principal components analysis
- Lectures: [slides] [annotated slides]
- Reading:
  - Murphy: PCA, Ch 12.2
- Extra reading:
  - PCA: [wiki]

[Jan 26] PCA (continued)
- Lectures: [slides][annotated slides]
- Reading:
  - SVD: [wiki]

Week 5: [Jan 29] Learning as Loss Minimization; Least Squares
- Lectures: [slides][annotated slides]
- Reading:
  - CIML: Ch. 7
  - Murphy: Ch. 8.1-8.3

[Jan 31] Regularization and Optimization; Gradient Descent
- Lectures: [slides][annotated slides]
- Reading:
  - CIML: Ch. 9
  - [convex functions]
  - [1-dim linear regression]
  - [linear regression]

[Feb 2] Probabilistic Models; the Log Loss
- Lectures: [slides][annotated slides]
- Reading:
  - CIML Ch. 9
  - Murphy:

Week 6: [Feb 5] Optimization: Gradient Descent & Stochastic Gradient Descent
- Lectures: [slides] [annotated slides]
- Reading:
  - CIML: Ch. 14

[Feb 7] MIDTERM

[Feb 9] Midterm review; GD/SGD + Practical Issues
- Lectures: [slides][annotated slides]
- Reading:
  - CIML: Ch. 14

Week 7: [Feb 12] Guest lecture: John Thickstun; GD/SGD + Practical Issues
- Lectures: [slides] [annotated slides]
- Reading:
  - CIML: Ch. 14

[Feb 14] Probabilistic estimation: MLE and MAP
- Lectures: [slides] [annotated slides]
- Reading:
  - CIML: Ch. 9

[Feb 16] Multi-Class Classification
- Lectures: [slides] [annotated slides]
- Reading:
  - CIML: Ch. 6; Ch. 11

Week 8: [Feb 21] Non-convexity: Feature mappings (kernels) and neural networks
- Lectures Notes: [pdf]
- Reading:
  - CIML: Ch. 11, Ch. 10
- Extra Readings:
  - [Artificial Neural Networks]

[Feb 23] Neural Nets & Backpropagation
- Lectures Notes: [pdf]
- Reading:
  - Bishop: [Bishop] 5.1, 5.3, 5.5
- Extra Readings:
  - [A better understanding of backprop]. What you gain from coding it up!(and a good read as well.)
  - [Multi-layer Perceptrons] A reasonable discussion of MLPs definition; also backprop pseudocode (the lack of a derivation of backprop makes the pseudocode harder to follow)
  - [Another backprop example]

Week 9: [Feb 26] Auto-Differentiation, Computation Graphs, and the Baur-Strassen Theorem
- Lectures Notes: [pdf]
- Reading:
  - Bishop: [Bishop] 5.1, 5.3, 5.5
- Extra Readings:
  - [A modern survey of ML Libraries.] Strange that it does not connect to basic complexity results on auto-diff, which is fundamentally the issue here.
  - ["All for the price of one."] A nice perspective.
  - [Proof of the Baur-Strassen Thm.] An elementary proof as to why AutoDiff is really possible (ignore all the stuff about fields. The proof is really BackProp!)

[Feb 28] Initialization/Weight symmetries, saddle points, and non-convex optimization
- Lectures Notes: [pdf]
- Reading:
  - A more modern backprop presentation [here]. This also discusses "saturation".
- Extra Readings:
  - Lots of stuff out there on how to initialize networks. They basically make sense from scaling considerations. See [Xavier-Initialization] and [here.]

[Mar 2] Structured neural nets: Convolutions and Convolutional Neural Nets (and maybe RNNs)
- Lectures Notes: [pdf]
- Reading:
  - [Conv Nets]
  - Also, if you are not familiar with convolutions, see [wiki].
- Extra Readings:
  - Some representational issues [here]. Should be taken with a grain of salt as what we can represent is not necessarily what we can easily find with gradient descent.

Week 10: [Mar 5] Probabilistic graphical models (and structured models)
- topics: inference, Gaussian mixture models, topic mixture models, and hidden Markov models
- Lectures Notes: [pdf]
- Reading:
  - Murphy: Mixture Models and Mixture of Gaussians 11.1, 11.2, 11.2.1.
  - Murphy: HMMs 17.3
- Extra Readings:
  - Bishop 13.2 is a reasonable alternative to Murphy 17.3.
  - [The Forward-Backward algorithm].
  - Wiki pages on [HMMs], [topic models] and [mixture models].

[Mar 7] The EM algorithm by example: The "topic" modeling problem
- Lectures Notes: [pdf]
- Reading:
  - Murphy: EM (or Bishop Ch 9)
- Extra Readings:
  - CIML: CH 16 (not very complete)
  - [ The EM algo ]
  - [The Baum-Welch algorithm].

[Mar 9] DeepBlue, AlphaGo, and AI...
- Lectures: [slides] [annotated slides]
- Extra Readings:
  - See [paper] and [paper], although they are light on details.
  - A blog [post] with a better explanation of the underlying ideas.
  - A [perspective] that I (mostly) tend to share.