Policy Iteration
Note: value iteration never actually computes a policy: you can back it out at the end, but during computation it’s irrel.
Policy iteration as an alternative
- Initialize ?0(s) to some arbitrary vector of actions
- Loop
- Compute v?i(s) according to previous formula
- For each state s, re-compute the optimal action for each state
-
-
- Policy guaranteed to be at least as good as last iteration
- Terminate when ?i(s) = ?i+1(s) for every state s
Guaranteed to terminate and produce an optimal policy. In practice converges faster than value iteration (not in theory)
Variant: take updates into account as early as possible.