MDP Model (continued)
The agent has a value function that determines how good its course of action is.
- value function might depend arbitrarily on entire history:v({s0, a0, s1, a1, ...}) ? ?
The agent’s behavior is evaluated over a finite horizon or in the limit over an infinite horizon.
The agent’s task is to construct a policy that maximizes the expectation of the value function over the specified horizon.