Summary of Reinforcement Learning

General problem is learning to act optimally based only on rewards accumulated from repeated trials

Most techniques are based on the usual MDP formulation: full observability, infinite horizon, discounted total reward maximizing

Most techniques guarantee convergence provided the state space is “fully explored”
- if this is not the case---if the agent is to be “deployed” before training is complete, there is some advantage to exploration: acting suboptimally in order to learn more
- the tradeoff between the expected value of exploration and expected value of acting optimally can be represented formally (though weakly)

General problem is learning to act optimally based only on rewards accumulated from repeated trials