Q Learning

The function Q(s, a) is (an estimate of) the expected future reward associated with executing a in state s:

- from Q(s,a) the optimal action ?*(s) is obtained by taking the max
- we want to learn this Q function directly

Learning framework: repeatedly
- Takes some action dictated by the Q function
- Gets some reward r
- Updates Q function appropriately

The premise: learn the optimal action a for state s directly