CSE 547/Stat 548, Spring 2018

Machine Learning for Big Data

Instructor: Sham Kakade

TAs: Adam Gustafson, Rahul Kidambi, Alon Milchgrub, Krishna Pillutla

Contact: cse547-instructors@cs.washington.edu

PLEASE COMMUNICATE TO THE INSTUCTOR AND TAs ONLY THROUGH THIS EMAIL (unless there is a reason for privacy in your email).

Discussion: Canvas discussion board

Class lectures: TTh 10:00-11:20am, Room: MOR 220

Office Hours:

Please double check the website before you arrive for location changes/cancellations.

Rahul Kidambi: Mondays - 4:30-5:30 pm at CSE 678.

Alon Milchgrub: Tuesdays at 12:30-1:30 pm at CSE 007.

Adam Gustafson: Wednesdays - 1:30 - 2:30 pm at CMU B023 (Stat Tutoring Center)

Krishna Pillutla: Thursdays, 4.30 - 5.30 pm. Location: CSE 678.

Prerequisites and About the Course

Prerequisites: To be successful in this course, students should have prior exposure to a graduate machine learning course, at the level of CSE 546. We also expect students to have a command of linear algebra, algorithms, statistics, probability, and programming (in python).

Data analysis methods in machine learning and statistics play a central role in industry and science. The growth of the Web and improvements in data collection technology in science have lead to a rapid increase in the magnitude and complexity of these analysis tasks. This growth is driving the need for scalable, parallel and online algorithms and models that can handle this "Big Data". This course will provide a broad foundation.

In particular, we focus on the challenges associated with datasets of large size and dimensionality, including settings where the dimensionality of the data is growing faster than the number of data points. We will also explore the computational foundations associated with performing these analyses in the context of parallel and cloud architectures.

Discussion Forum and Email Communication

IMPORTANT: All class announcements will be broadcasted using Canvas. Please send questions about homeworks, projects and, lectures to the Canvas discussion board . If you a question of personal matters, please email the instructors list: cse547-instructors@cs.washington.edu

Optional textbooks

Material in the following optional textbooks may be helpful:

Optional Textbook: Understanding Machine Learning: From Theory to Algorithms, Shai Shalev-Shwartz and Shai Ben-David.

Optional Textbook: Pattern Recognition and Machine Learning, Chris Bishop.

Optional Textbook: Machine Learning: A Probabilistic Perspective, Kevin Murphy.

Optional Textbook: The Elements of Statistical Learning: Data Mining, Inference, and Prediction Trevor Hastie, Robert Tibshirani, Jerome Friedman.

Optional textbook: Machine Learning , Tom Mitchell.

AWS Credits (and using another remote cluster)

IMPORTANT: Obtain AWS credits ASAP using this [link]. If you have technical problems, please post to the discussion board. If you have already used these credits, please contact the instructors at the course mailing list. YOU MUST TAKE CARE IN USING THESE AWS CREDITS OVER THE ENTIRE TERM, WHICH WILL BE USED IN YOUR PROJECT. If you go over, penalties will be applied.

If you do not plan to use AWS and have access to another remote cluster, you must let the instructors know this (e.g. on HW1). Your cluster resources must be such that you have to think about job scheduling and other parallelization issues.

Policies

ALL HOMEWORK MUST BE SUBMITTED, EVEN IF IT IS FOR 0 CREDIT, IN ORDER TO PASS THE CLASS. (Empty homeworks do not count.) There will be a poster session on Thurs, May 31, from 9-noon. YOU MUST BE PRESENT AT THE POSTER SESSION TO PASS THE CLASS. THESE POLICIES APPLY EVEN IF YOU ARE USING THE PASS/NO-PASS GRADING OPTION. Do not take the class if you are not able to fulfill these requirements.

Grades:

Grades will be based on four assignments (60%) and a course project (40%). In a small number of cases, grades may be adjusted after this breakdown, e.g. grades may increase for particularly meticulous HWs, course participation, or activity on discussion boards.

Homeworks:

ALL HOMEWORK MUST BE SUBMITTED, EVEN IF IT IS FOR 0 CREDIT, IN ORDER TO PASS THE CLASS. (Empty homeworks do not count.)

The entire HW must be submitted in one single typed pdf document (not handwritten). This document must contain all plots. Latex is preferable, though another comparable typesetting method is also acceptable.

Homework must be done individually: each student must hand in their own answers. In addition, each student must submit their own code in the programming part of the assignment (we may run your code). It is acceptable for students to discuss problems with each other; it is not acceptable for students to look at another students written answers. It is acceptable for students to discuss coding questions with others; it is not acceptable for students to look at another students code. You must also indicate on each homework with whom you collaborated with.

We expect the students not to copy, refer to, or seek out solutions in published material on the web or from other textbooks (or solutions from previous years or other courses) when preparing their answers. Students are certainly encouraged to read extra material for a deeper understanding. If you do happen to find an assignment's answer, it must be acknowledged clearly with an appropriate citation on the submitted solution.

HW LATE POLICY: Homeworks must be submitted by the posted due date. You are allowed up to 2 LATE DAYs for the homeworks throughout the entire quarter, which will automatically be deducted if your assignment is late. In particular, for any day in which an assignment is late by up to 24 hours, then one late day will be used (up to two late days). After two of the late days are used up, any assignment turned in late will incur a reduction of 33% in the final score, for each day (or part thereof) if it is late. For example, if an assignment is up to 24 hours late, it incurs a penalty of 33%. Else if it is up to 48 hours late, it incurs a penalty of 66%. Any longer, it will receive NO CREDIT.

RE-GRADING POLICY: All regrading requests must be submitted within one week after the grades are posted, even if the request is one where the instructors made an error of omission in grading (or an error in recording the score). The instructors will not accept regrade requests after this time period. This is to ensure we can address all concerns in a timely manner. All grading related requests must be submitted to the instructor mailing list.

Project Page.

See the project page link for details. Information about software can be found here.

You are expected to complete the final project for the class. This will provide you with an opportunity to apply the machine learning concepts you have learned. We will update the project requirements and due dates during the quarter. There will be a poster session on Thurs, May 31, from 9-noon. YOU MUST BE PRESENT AT THE POSTER SESSION TO PASS THE CLASS.

Readings

The required readings are for your benefit and they encompass material that you are required to understand. The extra reading is provided to give you additional background. Please do the required readings before each class.

Academic and Personal Integrity

The instructor expects (and believes) that each student will conduct himself or herself with integrity. While the TAs will follow the course and university policies with regards to grading and proctoring, it is ultimately up to you to conduct yourself with academic and personal integrity for a variety of good reasons (that go well beyond the class itself).

Diversity and Gender in STEM

While many academic disciplines have historically been dominated by one cross section of society, the study of and participation in STEM disciplines is a joy that the instructor hopes that everyone can pursue. The instructor encourages students to both be mindful of these issues, and, in good faith, try to take steps to fix them. You are the next generation here.

Recitation Materials

Recitation 1: PyTorch and AWS
- [slides]

Lecture Notes and Readings

Lecture 1: [Mar 27] Intro. Course overview. Review: Backprop (and Auto-Diff motivation).
- Lectures Notes: [pdf]
- Required reading:
  - A more modern backprop presentation [here]. This also discusses "saturation". The notation for a and z is different than in the lecture notes.
- Extra readings:
  - [Bishop] 5.1, 5.3, 5.5. I find Bishop's presentation to be one of the better ones.
  - [A better understanding of backprop]. What you gain from coding it up!(and a good read as well.)
  - [Multi-layer Perceptrons] A reasonable discussion of MLPs definition; also backprop pseudocode (the lack of a derivation of backprop makes the pseudocode harder to follow)

Lecture 2: [Mar 29] Auto-Differentiation and Computation Graphs.
- Auto-diff is a game changer in applied ML.
- Lectures Notes: [pdf]
- Required reading:
  - [ A modern survey of ML Libraries.] While this does not connect to the complexity results on auto-diff (or discuss when auto-diff provides correct results), it is a good survey.
- Extra Readings:
  - [Proof of the Baur-Strassen Thm.] An elementary proof as to why AutoDiff is really possible (ignore all the stuff about fields. We will cover the argument in class.) There are earlier results as well, though this one gives a clear model of computation).

Lecture 3: [Apr 3] Recitation: AWS, PyTorch
- It is your responsibility to get PyTorch installed.

Lecture 4: [Apr 5] The Time Complexity of AD (and the Bauer-Strassen Theorem)
- Lectures Notes: [pdf]
- Required reading:
  - [Automatic differentiation.]
- Extra reading:
  - [survey paper] All great ideas get reinvented many times!

Lecture 5: [Apr 10] Dynamic Computation Graphs; "Dimension Free" Optimization
- We have a black box for computing gradients, how do we use it?
- Lectures Notes: [pdf]
- Required reading: Gradient Descent
  - 9.1, 9.2 and 9.3 of Boyd, Stephen and Lieven Vandenburghe. Convex Optimization. Cambridge: Cambridge University Press, 2004.
- Extra reading:
  - Auto-Differentiating Linear Algebra.

Lecture 6: [Apr 12] "Dimension Free" Optimization, Stationary Points, and GD/SGD
- Lectures Notes: [pdf]
- Required reading:
  - Survey (and the paper) of various ML optimization tools in packages.
  - Blog post. Escaping (second order) saddle points.
- Extra reading:
  - The details: guarantees for GD converging to seconda order stationary points.
  - "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima".
  - "Don't Decay the Learning Rate, Increase the Batch Size"

Lecture 7: [Apr 17] SGD, Momentum/Nesterov's acceleration, and adaptive gradient methods
- Lectures Notes: [pdf]
- Required reading:
  - Le Cun, Leon Bottou Yann. "Large Scale Online Learning." NIPS, 2004.
- Extra reading:
  - BFGS, and L-BFGS

Lecture 8: [Apr 19] Adaptive gradient methods and the Nearest-Neighbor problem
- Lectures Notes: [pdf]
- Required reading:
  - Momentum insights. This looks to be a good treatment (and not misleading like most treatments).
  - The Conjugate Gradient Algorithm. The one true (batch) algorithm. (L-BFGS is like CG but what one should use on most problems, if you can compute exact gradients)
  - The "K-nearest neighbor" problem
- Extra reading:
  - Adagrad and Adam.
  - Technically, momentum and acceleration only help due to minibatching, as shown here, along with a provable fix.
    That said, mini-batching is pretty helpful to do, and, with momentum, mini-batching is even more important!
  - Bubeck. Convex Optimization: Algorithms and Complexity.
    If you are interested in why "momentum" methods are "optimal" (in a precise sense). See Chapter 3 and 3.5. The whole monograph is excellent and worth reading.
  - Adaptive methods do not always help. Though they do help in some circumstances, quite a bit.

Lecture 9: Information theoretic metric learning [Apr 24]
- Guest lecture: Adam Gustafson
- Lectures Notes: [pdf]
- Required reading:
  - A nice algorithm for metric learning. It could be interesting to try on the project.

Lecture 10: Approximate neighbor finding and locality-sensitive hashing (LSH). Background on random projections.
- Lectures Notes: [pdf]
- Lectures notes:
- Required reading:
  - Approximate nearest neighbors by locality-sensitive hashing (LSH):
    survey paper.
  - Hash tables
- Extra reading:
  - The JL proof is pretty clean (and, if you interested in theoretical aspects of ML/probabilistic analysis, it is worthwhile to understand it). It is really based on the central limit theorem.
  - Voronoi diagrams

Lecture 11: LSH, continued
- Lectures Notes: [pdf]
- Required reading:
  - Johnson-Lindenstrauss. Understanding random projections is helpful.

Lecture 12: KD-trees, Ball trees, Cover trees
- Lectures Notes: [pdf]
- Required reading:
  - Johnson-Lindenstrauss. Understanding random projections is helpful. KD-trees tutorial
  - Randomized Partition Trees or LSH?
    Sinha, K. This is an interesting idea, and you coud try out a variant for your project. This is also worth a read.
- Extra reading:
  - Practical insights for nearest neighbor methods:
    Liu, Gray, Moore, and Yang.
  - Cover Trees: NN search in general distance functions (e.g. in arbitrary metric spaces)
    "Cover Trees for Nearest Neighbor."
  - Moore, Andrew W. "An introductory tutorial on kdtrees."
  - Ball Trees
    Wikipedia

Lecture 13: Minimizing sums of functions; variance reduction; parallelization
- Extra reading:

Lecture 14: Minimizing sums of functions.
- Required reading:
  - Stochastic dual coordinate ascent
    Shalev-Shwartz & Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization.
- Extra reading:
  - Convex optimization overview:
    Bubeck. Convex Optimization: Algorithms and Complexity.

Lecture 15: Finite Sum optimizations; (tentative) Tradeoffs in large scale learning. (tentative)
- Lectures: [Faster methods for minimizing sums of loss functions; SVRG]
- Required reading:
  - SVRG
    Johnson & Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction.
- Extra reading:
  - Non-convex SVRG:
    Stochastic Variance Reduction for Nonconvex Optimization.

Lecture 16: Averaging, Averaging, Averaging (and statistics)
- Plus: mini-batching, parameter mixing, hog-wild, ...
- Lecture notes: [SGD, Averaging: a one dimensinal example]
- Required reading:
  - Le Cun, Leon Bottou Yann. "Large Scale Online Learning." NIPS, 2004.
  - Some slides: [Averaging and Tradeoffs]
  - It works even in non-convex settings: tail averaging in an LSTM model.
- Extra reading:
  - SGD + iterate averaging is (asymptotically) statistically optimal:
    Polyak & Juditsky. Acceleration of Stochastic Approximation by Averaging.
  - A (non-asymptotic) and sharp analysis of this
    Characterizing the Optimality of SGD.

Lecture 17: Averaging; Bandits: Non-Adaptive and Adaptive Sampling
- Lecture notes: [bandits]
- Required reading:
  - Ch 2. Introduction to Multi-Armed Bandits.
  - [wiki page]
- Extra reading:
  - K-arm bandits and UCB
    Chapters 1 and 2
  - Survey paper for the (related) "best-arm" identification problem
    Best-arm Identification Algorithms.

Lecture 18: UCB; Bayesian methods: Gittins index and Thompson Sampling
- Lecture notes: [bandits]
- Required reading:
  - Thompson sampling
    Experimental comparisons
- Extra reading:
  - The Gittins Index: A (beautiful) Bayes optimal algorithm
    A very short proof.
  - Relations between Thompson sampling and UCB
    Thompson sampling and Posterior sampling.

Lecture 19: Bayesian methods: Thompson Sampling; Linear Bandits and Contextual Bandits
- Lecture notes: [bandits]
- Required reading:
  - Bandits Tutorial website.
    look at linear bandits: slides for a large set of actions.
  - A "generic" (and sub-optimal) contextual bandit algo
    Epoch-greedy algorithm
- Extra reading:
  - Gaussian Process Bandits (extending lin UCB to Kernels and high dimensions)
    Algorithm and Analysis.
  - Linear bandits and Lin UCB
    Proof of correctness
  - Contextual linear bandits (Lin UCB algorithm)
    Same idea. A modified analysis.
  - Thompson sampling for contextual linear bandits
    Proof of correctness.