Course Description

Weather forecasting, predicting stock market trends, and deciding which ads to present on a web page are examples of sequential prediction problems. Online learning is a powerful and popular way of dealing with such problems. An online learning algorithm observes a stream of examples and makes a prediction for each element in the stream. The algorithm receives immediate feedback about each prediction and uses this feedback to improve its accuracy on subsequent predictions. In contrast to statistical machine learning, online learning algorithms don’t make stochastic assumptions about the data they observe, and even handle situations where the data is generated by a malicious adversary. There has been a recent surge of scientific research on online learning algorithms, largely due to their broad applicability to web-scale prediction problems.

This course, Online Learning (CSE599s), will provide a rigorous introduction to state-of-the-art online learning algorithms with an emphasis on algorithm design and theoretical analysis. Topics include: background in convex analysis, regret minimization, online optimization, expert problems, online learning with partial feedback and explore/exploit tradeoffs (a.k.a. bandit problems) and a selection of advanced topics.

Planned Topics

After motivating and defining the online learning problem, we will present different families of online learning algorithms. All of the algorithms discussed in this course are both practical and come with theoretical worst-case guarantees.

Follow the regularized leader The follow the regularized leader (FTRL) algorithm family is in some ways the most fundamental algorithm in online learning. We will introduce and analyze a simple instance of the algorithm using a general-purpose and very powerful lemma. Some necessary background from convex analysis will also be introduced.

Online gradient descent Online gradient descent algorithms are close relatives of stochastic gradient descent. We will present a regret bound analysis for linear functions, and show how that same bound applies for free to general convex function. This immediately gives us algorithms for logistic regression and linear support vector machines, for example.

Adaptive per-coordiante learning rates This simple modification to gradient descent can produce dramatically better performance on large datasets. We illustrate the problem with global learning rates via one-dimensional examples, and then extend the analysis from the previous lecture to provide tighter bounds when adaptive per-coordinate learning rates are used.

Follow the proximally-regularized leader We show that follow-the-regularized leader and online gradient descent are closely related. However, their behavior differs in some important respects. We show how L1 regularization for producing sparse models (e.g., the Lasso algorithm) can be implemented much more effectively in the FTRL context.

Alternative notions of regret So far, we have analyzed algorithms by comparing their performance to a fixed best-in-hindsight comparator. However, for some applications other notions of regret are more appropriate, for example, when the "true" model of the world is changing.

Mirror descent Mirror descent generalizes online gradient descent, and allows us to develop algorithms for many other problems. We consider the best-experts problem, where on each round we must choose an "expert" from a set whose advice we follow for that round.

Bandit algorithms Algorithms for the k-armed bandit problem extend the best-experts problem with partial feedback. On each round you only find out about the reward obtained by following the expert you chose; you have no way of knowing directly whether or not some other expert would have been better. As an example, consider showing ads to a user on a webpage --- we model the ads as "experts" and have no way of knowing whether the user would have clicked on an ad if we didn't choose to show it.

High probability bounds Most of the regret bounds for experts and bandit algorithms to this point only hold in expectation; the algorithms use randomization, and if we get unlucky then we can do worse than the bound suggests we should. However, with some modification to the algorithms, we can design algorithms that guarantee good performance with high probability.

Stochastic bandits The bandit algorithms introduced so far make no statistical assumption about the rewards of the experts/actions. Can we get better bounds if we know rewards follow some (unknown) distributions?

Contextual bandits Basic bandit algorithms must choose an action without the aid of any additional context. Contextual bandit algorithms, however, can leverage additional information about the current state of the world each time a decision is made. For example, when placing ads on a webpage, perhaps we should make that decision differently depending on the content of the webpage.

Advanced topics Advanced topics may include an examination of connections between game theory and online learning; the application of online learning algorithms to batch machine learning problems; expert and bandit problems with combinatorial structure; and selective sampling and label efficiency.


This course assumes background in basic probability theory and linear algebra. Key mathematical concepts will be reviewed before they are used, but a certain level of mathematical maturity is expected. The course is theoretical and does not involve any programming.


The course can be taken either for a grade or for credit/no-credit. Grades will be given based on homework assignments, lecture notes, and class participation. 


Apr 18, 2012

Updated homework 2 with minor typos fixed in 3(d) and 3(e).

Apr 9, 2012

Homework exercise 2 posted.

Apr 5, 2012

Lecture 3 notes posted.

Apr 2, 2012

The scribe sign-up sheet is available. Email Brendan (mcmahan[at] if you have issues accessing.

Mar 29, 2012

Homework exercise 1 published here

Mar 13, 2012

Added course details.