CSE 599, Reinforcement Learning and Bandits, Spring 2019

Your Course Project

Important Dates

Thu, Apr 25: Project Proposal
Thu, May 9: Project Milestone
Wed, June 5th, 1.30pm-3.30pm : Poster Session
Thu, June 13: Project Report

Project Guidelines

Projects should be done in groups of 2-3, with the intention of exploring some concept from the class in greater detail. If you cannot find team members for a group, send an email to the instructors. The projects can be of many different forms. They can consist of a reading project that synthesizes several papers on a topic. In several cases, there will be an opportunity to further include additional empirical or theoretical investigations along with a discussion of these papers. We list ideas for a few project topics on this page. Of course, it is okay to do projects on topics not listed here as well.

Project Ideas

Exploration in Tabular MDPs: The RMAX paper discussed in the lectures has been improved upon in its dependence on various problem parameters, with nearly matching upper and lower bounds in some cases. For a reading project, synthesize up to 3 or 4 such improvements. Some suggestions of readings are Azar et al., Jin et al., Zanette and Brunskill and references therein. Additional directions for a theoretical or empirical project:
- Theory: Pick a paper above for the episodic setting and extend it to the discounted, infinite horizon case.
- Empirical: Take a standard exploration benchmark like the chain MDP. Evaluate each algorithm as the horizon, state space and reward sparsity are varied.
RL with continuous actions: For a reading project, study RL techniques that extend to continuous action spaces. An example comes from control-theoretic models such as Linear-Quadratic Regulators. A good starting point might be the tutorial from Ben Recht, recent results on policy gradients method such as this and this and a comparison with model-based approaches here. Additional directions for a theoretical or empirical project:
- Theory: Most LQR results focus on spherical covariances for the Gaussian noise. What happens when the covariance is non-spherical? Does exploration play a more important role in getting the right dimension dependence?
- Empirical: Evaluate the robustness of various LQR algorithms to model mismatch, when the actual MDP does not obey the LQR model. A common generalization is piecewise LQR. Which models can handle the piecewise structure empirically?
Off-policy Policy evaluation in RL: How to learn a good policy when the agent cannot take actions in the environment, but only has access to a dataset of recorded experiences from executing a fixed behavior policy? Or even simpler, how to estimate the expected reward a policy will attain without actually executing it? For a reading project, start with the paper on eligibility traces, along with several improvements such as doubly robust, MAGIC, MRDR and the recent paper on breaking the curse of horizon. For an empirical project, consider evaluating these estimators in a common set of environments while varying the degree of similarity between behavior and evaluation policies.
Learning from experts: Often learning by trial-and-error as in RL can be expensive and/or unsafe. In such situations, a common alternative is to learn a good agent policy by trying to imitate a human expert. Several works have been done in this setting, including CPI, DAgger, SEARN, AggreVaTe and its deep version, as well as more programmatic aspects. For a reading project, prepare a summary of your picks amongst these and other related papers. You can also consider other closely related settings such as inverse reinforcement learning or apprenticeship learning. How robust are various techniques to a suboptimal expert? Can you analyze this in theory or empirically?
Hierarchical RL: When the action space is complex, and/or the planning horizon is long, it is often considered helpful to impose a hierarchy on the action space. This has been formalized in theory using frameworks such as options. There is some theory on the benefits of doing this here and here. For a reading project, survey these or other papers which focus on techniques for hierarchical RL. Additional ideas for a theoretical or empirical project:
- Theory: While there are some works on learning with options, little is known in theory about the benefits in terms of sample complexity when the options themselves have to be learned. Can you try to analyze this in a multitask RL setting similar to Brunskill and Li?
- Empirical: Can you empirically study the benefits of known versus learned options? For instance, in a 4-room like environment, how fast does the agent learn when knowing options to get to the door versus learning them through an option learning method such as option-critic.
Exploration with rich observations: Most RL theory assumes a small state space. How to scale-up exploration to problems where the states are perceived as high-dimensional sensory readings? Some recent works make progress in this direction, such as the Bellman rank paper, its model-based analogue and a specialization to block MDPs, as well as Thompson Sampling based approaches. For a reading project on theory, study these or other related papers on the topic. Additional ideas for a theoretical or empirical project:
- Theory: All the works above are in a PAC setting, and wasteful in obvious ways sample-wise. Pick one, and turn it into a no-regret algorithm.
- Empirical: There is a proven computational hardness for the OLIVE algorithm here, but we can still attempt a solution with gradient descent. How does it compare with the PCID and Thompson Sampling approaches?
Minimax bounds for off-policy evaluation in Contextual Bandits: The SWITCH paper presents a lower bound for off-policy evaluation in a minimax sense for the contextual bandit setting, and then presents a more adaptive estimator depending on the ease of modeling the reward function. Can you refine the lower bounds in the paper to take advantage of known reward structures? For instance, if the expected reward is known to be a linear function of the context, what is the correct minimax rate?
Off-policy policy optimization in Contextual Bandits: When the goal is not just evaluation of a single policy, but finding a good policy from a class, how to best use logged data? Existing works like POEM show the benefits of variance regularization in specific parametric settings. Can you extend the variance regularization techniques from the EXP4.P to do a POEM style result for general policy sets? Empirically, how does the choice of various off-policy learning methods effect a contextual bandit algorithm (such as to train the greedy policy in epsilon-greedy)?
Efficient realizable contextual bandit learning Many contextual bandit algorithms (such as LinUCB) assume that the reward has some nice structure given the context. Beyond the linear setting, general structures have been analyzed here and here. However, the algorithms retain a significant computational overhead empirically compared with the best agnostic algorithms. Starting with the Regressor Elimination algorithm, can you come up with an efficient version borrowing ideas from the taming the monster paper? Can you ensure no forced uniform exploration like in the RegCB paper?

Grading

The grading of the final project is split amongst two deliverables:

A project poster presentation (30% of the grade).
A final report (70% of the grade).

Your final report will be evaluated by the following criteria:

Merit: Do you have sound reasoning for the approach? Is the question well motivated and are you taking a justifiably simple approach or, if you are choosing a more complicated method, do you have sound reasoning for doing this?
Technical depth: How technically challenging was what you did? Did you use a package or write your own code? It is fine if you use a package, though this means other aspects of your project must be more ambitious.
Presentation: How well did you explain what you did, your results, and interpret the outcomes? Did you use good graphs and visualizations? How clear was the writing? Did you justify your approach?

Poster Session

We will hold a poster session in TBD. Each team will be given a stand to present a poster summarizing the project motivation, methodology, and results. The poster session will give you a chance to show off the hard work you put into your project, and to learn about the projects of your peers.

Here are some details on the poster format:

We will provide poster boards that are 32x40.
Both one large poster (recommended) and several pinned pages are OK. The TAs can help with printing your posters, provided you give them enough notice.

You must submit your poster on Canvas.

Project Report

Your final submission will be a project report on Gradescope. Your write up should be 8 pages maximum in NeurIPS format, not including references. You must use the NIPS LaTex format. You should describe the task you solved, your approach, the algorithms, the results, and the conclusions of your analysis. Note that, as with any conference, the page limits are strict. Papers over the limit will not be considered.