CSE 599, Spring 2019
Reinforcement Learning and Bandits
Class lectures: MW 1:30-2:50pm, Room: CSE2 G04
Please communicate to the instructors and TA only through this
account. Emails not sent to this list, with regards to the course,
will not be responded to in a timely manner.
Please send all questions about
homeworks, lectures, and policies to the Piazza discussion board.
Please make sure you monitor announcements from
Piazza. It is important for you to make sure you get these announcements in a timely manner.
***Please double check the website before
you arrive for location changes/cancellations.***
Krishna Pillutla: Wednesdays, 3-4 pm, CSE2 152
About the Course and Prerequisites
This course focuses on
several theoretical foundations of sequential decision making. The course
is concerned with the general problem of reinforcement learning and
sequential decision making, going from algorithms for small-state
Markov decision processes to methods that handle large state
spaces. We also cover sequential decision making in the multi-armed
bandit framework and proceed to the more general contextual bandit
problem. We focus on the main algorithms, performance guarantees,
lower bounds, and applications.
Prerequisites: Students entering the class should have a
commanding grasp of machine learning, probability and statistics,
optimization, and linear algebra. Students who are weak in these areas
should take the course at a later date.
Grades will be based on three assignments (50%) and a course
project (50%). ALL HOMEWORK MUST BE SUBMITTED, EVEN IF IT IS FOR 0 CREDIT, IN ORDER
TO PASS THE CLASS. (Empty homeworks do not count.) There will be a
poster session on
Thursday, June 6th from 10-noon Wednesday, June 5th from 1.30-3.30 pm. YOU MUST BE PRESENT AT THE POSTER SESSION TO PASS THE
See the project page link for
You are expected to complete a project for the class. This will
provide you with an opportunity to go in depth in one reinforcement
learning topic you have learned about.
Homeworks: All homework will be mathematical in nature,
focussing on the theory of RL and bandits; there will not be a
programming component. The entire HW must be submitted in one single
typed pdf document (not handwritten). There will be two homework
sets and an optional homework 3, in addition to a homework 0.
HW0 is MANDATORY to pass to satisfactory
level; it is to check your knowledge of the prerequisites in
probability, statistics, and linear algebra.
The cumulative homework score will be obtained by the following weighting scheme: 20%, 40%, 40% for the mandatory components of HW0-HW2 respectively.
All optional and extra credit components, including the optional HW3 will be weighed in to an additional 30% (i.e., the maximum cumulative grade on the HW is 130/100). If the cumulative optional work for extra credit is equivalent to the work of one well-done homework, it will be assigned the full 30%.
Possible exceptions apply, based on instructors' judgement, to exceptional HW or extra credit which may be assigned a higher grade.
Homework must be done individually: each
student must hand in their own answers. It is
acceptable for students to discuss problems with each other; it is not
acceptable for students to look at another students written answers.
You must also indicate on each homework with whom you
HW Late Policy:
Homeworks must be submitted by the posted due
date. You are allowed up to 4 total LATE DAYs for the homeworks
throughout the entire quarter and up to 2 late days PER HOMEWORK assignment;
these will be automatically deducted if your assignment is late. For
example, any day in which an assignment is late by up to 24 hours,
then one late day will be used (up to two late days). After your late
days are used up, late penalties will be applied: any assignment
turned in late will incur a reduction in score by 33% for each late
day, so if an assignment is up to 24 hours late, it incurs a penalty
of 33%. Else if it is up to 48 hours late, it incurs a penalty of
66%. And any longer, it will receive no credit.
We will track all your late days
and any deductions will be applied in computing the final grades.
If you are unable to turn in HWs on time, aside from permitted days, then
do not enroll in the course.
All re-grading requests (for the
homework and the midterm) must be submitted (on Gradescope) within seven days after
any grades are released. This policy is to ensure that we can address
any concerns in a timely and fair manner. The focus of office hours
and in person discussions are solely limited to asking knowledge
The course will follow lectures notes. Helpful books are:
- Markov Decision Processes: Discrete Stochastic Dynamic Programming, by Martin Puterman.
- Neuro-Dynamic Programming, by Dimitri Bertsekas and John
- Dynamic Programming and Optimal Control, Vol. I and
Vol II, by D. P. Bertsekas.
- Algorithms of Reinforcement Learning, by Csaba Szepesvari. (pdf available online)
- Reinforcement Learning: An Introduction, by Rich Sutton and Andrew Barto. (draft available online)
Here are some related courses, with relevant material available online:
The instructors expect (and believes) that each student will conduct
himself or herself with academic (and personal) integrity. While the TA will follow the
course and university policies with regards to grading (see CSE
conduct policy), it is ultimately up to you to conduct yourself with
academic and personal integrity for a number of reasons that
go beyond the scope of just this class.
Diversity and Gender in STEM
While many academic disciplines have historically been dominated
by one cross section of society, the study of and participation in
STEM disciplines is a joy that the instructor hopes that everyone can
pursue, regardless of their socio-economic background, race, gender,
etc. The instructor encourages students to both be mindful of these
issues, and, in good faith, try to take steps to fix them. You are the
next generation here.
Homeworks should be submitted via Gradescope. There will be a
log of any changes to the HW (e.g. typos fixed,
clarifications, etc) in the discussion board.
- Homework 2
- Due Friday, May 24th.
- Homework 3 (Optional)
- Due Saturday, June 15th.
RL Notes: [pdf]
Contextual Bandits Notes: [pdf]
Off Policy Evaluation and Learning: [pdf]
Optimal Algorithms for i.i.d. Contextual Bandits [pdf]
- Notes will be periodically updated, so please get the
- Please let us know of typos.
- Week 1: [Apr 1] Markov Decision Processes.
- logistics; dynamic programming; policy/value iteration
- [Apr 5] Generative models and minimax sample complexity
- Week 2: [Apr 8] Generative models and minimax sample complexity
- [Apr 10] R-MAX and Model based RL
- Lecture notes:
- Extra Readings:
- Week 3: [Apr 15] R-MAX and Model based RL
- [Apr 17] Policy gradient methods + variants (natural gradient)
- Week 4: [Apr 22] Policy gradients/Actor-critic/Variance reduction
- [Apr 24] Function approximation. / Q-value iteration.
- Week 5: [Apr 29] Off-policy methods/ Q-learning
- [May 1] CPI and Dagger
- Week 6: [May 6] Intro to the contextual bandit setting
- [May 8] Contextual Bandits
- Week 7: [May 13] Efficient and optimal algorithms for i.i.d. contextual bandits (Part 1)
- [May 15] Efficient and optimal algorithms for i.i.d. contextual bandits (Part 2)
- Week 8: [May 20] Lin Bandits + Thompson sampling
- [May 22] Exploration in Contextual RL
- Week 9: [May 27] holiday
- [May 29] Exploration with rich observations. Bellman rank.
- Week 10: [June 3] Apprenticeship Learning, Imitation Learning, and Behavioral Cloning
- [Jun 5] Poster Session