Summary of Reinforcement Learning
General problem is learning to act optimally based only on rewards accumulated from repeated trials
Fundamental question is whether to learn the model explicitly
Most techniques are based on the usual MDP formulation: full observability, infinite horizon, discounted total reward maximizing
Most techniques guarantee convergence provided the state space is “fully explored”
- if this is not the case---if the agent is to be “deployed” before training is complete, there is some advantage to exploration: acting suboptimally in order to learn more
- the tradeoff between the expected value of exploration and expected value of acting optimally can be represented formally (though weakly)