Q Learning
The premise: learn the optimal action a for state s directly
The function Q(s, a) is (an estimate of) the expected future reward associated with executing a in state s:
- from Q(s,a) the optimal action ?*(s) is obtained by taking the max
- we want to learn this Q function directly
Learning framework: repeatedly
- Takes some action dictated by the Q function
- Gets some reward r
- Updates Q function appropriately