MDP Model of Agency
Time is discrete, actions have no duration, and their effects occur instantaneously. So we can model time and change as {s0, a0, s1, a1, … }, which is called a history or trajectory.
At time i the agent consults a policy to determine its next action
- the agent has “full observational powers”: at time i it knows the entire history {s0, a0, s1, a1, ... , si} accurately
- policy might depend arbitrarily on the entire history to this point
Taking an action causes a stochastic transition to a new state based on transition probabilities of the form Prob(sj | si, a)
- the fact that si and a are sufficient to predict the future is the Markov assumption