Properties of the Model
Assuming
- full observability
- bounded and stationary rewards
- time-separable value function
- discount factor
- infinite horizon
Optimal policy is stationary
- Choice of action ai depends only on si
- Optimal policy is of the form ?(s) = a
- which is of fixed size |S|, regardless of the # of stages