CSE 599, Spring 2019

Reinforcement Learning and Bandits

Quick links:
Notes Lectures Homework Office Hours Projects

Instructors: Alekh Agarwal Sham Kakade

Class lectures: MW 1:30-2:50pm, Room: CSE2 G04

Contact: rl_instructors@cs.washington.edu

Please communicate to the instructors and TA only through this account. Emails not sent to this list, with regards to the course, will not be responded to in a timely manner.

Discussion: Piazza discussion board

Please send all questions about homeworks, lectures, and policies to the Piazza discussion board.

Announcements:

Please make sure you monitor announcements from Piazza. It is important for you to make sure you get these announcements in a timely manner.

Office Hours:

Please double check the website before you arrive for location changes/cancellations.

Krishna Pillutla: Wednesdays, 3-4 pm, CSE2 152

About the Course and Prerequisites

This course focuses on several theoretical foundations of sequential decision making. The course is concerned with the general problem of reinforcement learning and sequential decision making, going from algorithms for small-state Markov decision processes to methods that handle large state spaces. We also cover sequential decision making in the multi-armed bandit framework and proceed to the more general contextual bandit problem. We focus on the main algorithms, performance guarantees, lower bounds, and applications.

Prerequisites: Students entering the class should have a commanding grasp of machine learning, probability and statistics, optimization, and linear algebra. Students who are weak in these areas should take the course at a later date.

Grading Policies

Grades will be based on three assignments (50%) and a course project (50%). ALL HOMEWORK MUST BE SUBMITTED, EVEN IF IT IS FOR 0 CREDIT, IN ORDER TO PASS THE CLASS. (Empty homeworks do not count.) There will be a poster session on ~~Thursday, June 6th from 10-noon~~ Wednesday, June 5th from 1.30-3.30 pm. YOU MUST BE PRESENT AT THE POSTER SESSION TO PASS THE CLASS.

Project:

See the project page link for details. You are expected to complete a project for the class. This will provide you with an opportunity to go in depth in one reinforcement learning topic you have learned about.

Homeworks:

All homework will be mathematical in nature, focussing on the theory of RL and bandits; there will not be a programming component. The entire HW must be submitted in one single typed pdf document (not handwritten). There will be two homework sets and an optional homework 3, in addition to a homework 0. HW0 is MANDATORY to pass to satisfactory level; it is to check your knowledge of the prerequisites in probability, statistics, and linear algebra.

Homework Grading:

The cumulative homework score will be obtained by the following weighting scheme: 20%, 40%, 40% for the mandatory components of HW0-HW2 respectively. All optional and extra credit components, including the optional HW3 will be weighed in to an additional 30% (i.e., the maximum cumulative grade on the HW is 130/100). If the cumulative optional work for extra credit is equivalent to the work of one well-done homework, it will be assigned the full 30%. Possible exceptions apply, based on instructors' judgement, to exceptional HW or extra credit which may be assigned a higher grade.

Homework Rules:

Homework must be done individually: each student must hand in their own answers. It is acceptable for students to discuss problems with each other; it is not acceptable for students to look at another students written answers. You must also indicate on each homework with whom you collaborated with.

HW Late Policy:

Homeworks must be submitted by the posted due date. You are allowed up to 4 total LATE DAYs for the homeworks throughout the entire quarter and up to 2 late days PER HOMEWORK assignment; these will be automatically deducted if your assignment is late. For example, any day in which an assignment is late by up to 24 hours, then one late day will be used (up to two late days). After your late days are used up, late penalties will be applied: any assignment turned in late will incur a reduction in score by 33% for each late day, so if an assignment is up to 24 hours late, it incurs a penalty of 33%. Else if it is up to 48 hours late, it incurs a penalty of 66%. And any longer, it will receive no credit. We will track all your late days and any deductions will be applied in computing the final grades. If you are unable to turn in HWs on time, aside from permitted days, then do not enroll in the course.

Re-grading policy:

All re-grading requests (for the homework and the midterm) must be submitted (on Gradescope) within seven days after any grades are released. This policy is to ensure that we can address any concerns in a timely and fair manner. The focus of office hours and in person discussions are solely limited to asking knowledge related questions.

Related courses

The course will follow lectures notes. Helpful books are:

Markov Decision Processes: Discrete Stochastic Dynamic Programming, by Martin Puterman.
Neuro-Dynamic Programming, by Dimitri Bertsekas and John Tsitsiklis.
Dynamic Programming and Optimal Control, Vol. I and Vol II, by D. P. Bertsekas.
Algorithms of Reinforcement Learning, by Csaba Szepesvari. (pdf available online)
Reinforcement Learning: An Introduction, by Rich Sutton and Andrew Barto. (draft available online)

Here are some related courses, with relevant material available online:

Nan Jiang, Statistical Reinforcement Learning
Shipra Agrawal, Reinforcement Learning
Alekh Agarwal and Alex Slivkins, Bandits and Reinforcement Learning
Byron Boots, Adaptive Control and Reinforcement Learning
Sergey Levine, Deep RL

Honor Code

The instructors expect (and believes) that each student will conduct himself or herself with academic (and personal) integrity. While the TA will follow the course and university policies with regards to grading (see CSE conduct policy), it is ultimately up to you to conduct yourself with academic and personal integrity for a number of reasons that go beyond the scope of just this class.

Diversity and Gender in STEM

While many academic disciplines have historically been dominated by one cross section of society, the study of and participation in STEM disciplines is a joy that the instructor hopes that everyone can pursue, regardless of their socio-economic background, race, gender, etc. The instructor encourages students to both be mindful of these issues, and, in good faith, try to take steps to fix them. You are the next generation here.

Homework

Homeworks should be submitted via Gradescope. There will be a log of any changes to the HW (e.g. typos fixed, clarifications, etc) in the discussion board.

Homework 0
- Due Mon, Apr 8th.
- hw0

Homework 1
- Due Thurs, Apr 18th.
- hw1

Homework 2
- Due Friday, May 24th.
- hw2

Homework 3 (Optional)
- Due Saturday, June 15th.
- hw3

Lecture Notes

RL Notes: [pdf]

Contextual Bandits Notes: [pdf]

Off Policy Evaluation and Learning: [pdf]

Optimal Algorithms for i.i.d. Contextual Bandits [pdf]

Notes will be periodically updated, so please get the latest copy.
Please let us know of typos.

Lectures and Readings

Week 1: [Apr 1] Markov Decision Processes.
- logistics; dynamic programming; policy/value iteration

[Apr 5] Generative models and minimax sample complexity
- Reading:
  - paper
- Extra Readings:
  - related idea with variance reduction

Week 2: [Apr 8] Generative models and minimax sample complexity
- Extra Readings:
  - related idea with variance reduction

[Apr 10] R-MAX and Model based RL
- Lecture notes:
- Reading:
  - paper
- Extra Readings:
  - Chapter 8

Week 3: [Apr 15] R-MAX and Model based RL
- Lecture notes:

[Apr 17] Policy gradient methods + variants (natural gradient)
- Lecture notes:

Week 4: [Apr 22] Policy gradients/Actor-critic/Variance reduction
- Lecture notes:

[Apr 24] Function approximation. / Q-value iteration.
- Lecture notes:

Week 5: [Apr 29] Off-policy methods/ Q-learning
- Lecture notes:

[May 1] CPI and Dagger
- Lecture notes:

Week 6: [May 6] Intro to the contextual bandit setting
- Topics: Setup, UCB, tau-greedy algorithm
- Lecture notes [pdf]

[May 8] Contextual Bandits
- Topics: Off-policy evaluation and learning
- Lecture Notes [pdf]

Week 7: [May 13] Efficient and optimal algorithms for i.i.d. contextual bandits (Part 1)
- Lecture notes [pdf]
- References: paper 1, paper 2

[May 15] Efficient and optimal algorithms for i.i.d. contextual bandits (Part 2)
- Lecture notes [pdf]
- References: paper 1, paper 2

Week 8: [May 20] Lin Bandits + Thompson sampling
- Lecture notes:

[May 22] Exploration in Contextual RL
- Lecture notes: TBA
- References: paper1 paper2

Week 9: [May 27] holiday

[May 29] Exploration with rich observations. Bellman rank.

Week 10: [June 3] Apprenticeship Learning, Imitation Learning, and Behavioral Cloning

[Jun 5] Poster Session
- Lecture notes: