- "MDP Enhancements" (15 points)
Starting with the given code in
RunApp.py,
Grid.py
and
MDP.py,
implement the following in the file MDP.py:
(a) a method generateAllStates(self)
to produce a set of all the states of any
MDP.
(Although the theory of Markov Decision Processes does not require that this set be finite,
you may assume that all our examples will have finite state spaces, in this assignment.)
This should be added to the class MDP in the file MDP.py.
The results should be stored in the member variable known_states
.
(Note that the random episode method there already finds some of the states
and puts them in this set, but is not guaranteed to find them all.)
Your method should use Breadth-First Search or Depth-First Search to find them all,
starting from the start state. You may re-use any of your own
code from previous assignments that you wish for this.
(b) a method ValueIteration(self, discount, niterations)
. This method should also be added
to the class MDP in MDP.py. It should start by creating a hash table in self.V, where each state
is a key and the value for each state is set to 0. It should then perform up to niterations
of Bellman updates to get an approximation of V*, the values associated with each state assuming
optimal behavior from that state forward.
Note that in Grid.py, there is a distinction made between an Action (which is part of the MDP
formulation) and Operator (which we use to implement actions). The effect of an action is
nondeterministic, in general. On the other hand, an operator has a deterministic effect.
In Grid.py, an action gets mapped probabilistically (according to the transition function T) to
one of the actions.
If it turns out that the precondition of the randomly selected operator is not satisfied,
then we define the result of the action to be the same state (no change of state).
Thus we follow the same convention used in the Berkeley lectures.
Note that you should NOT edit Grid.py, and you should NOT turn that in. The versions
of MDP.py and RunApp.py will be tested with the original Grid.py file, by an autograder.
- "Value Iteration Report" (10 points)
Next, show the results of applying your ValueIteration method to the Grid World MDP provided to
you in starter file Grid.py.
To do this, you should create a special output method GW_Values_string(V_dict)
in runApp.py that will accept
the dictionary self.V (that you created in the previous problem) and return a string
that represents it as a
nicely formatted map of the 4 columns by 3 rows of Grid World. Each value should be
shown to 3 decimal places.
- "Q-Learning" (20 points).
Add a new feature to MDP.py to implement Q-Learning. This should be callable with the
following: Grid_MDP.QLearning(discount, nEpisodes, epsilon). Assume the following:
Each episode starts in the start state and continues until the DEAD state is reached.
The learning rate alpha will be computed at each transition according to
alpha = 1/N(s,a), which is the reciprocal of the count of times that action a has been
taken from state s so far in the training (in all of the episodes so far).
Epsilon is the parameter for epsilon-greedy learning. Try epsilon = 0.05 and
epsilon = 0.2. This means that when epsilon is 0.05, the optimal action (based on
current Q values) will be taken with probability 0.95 and a random action will be
taken with probabiliy 0.05. Create and use a new hash table self.QValues in
class MDP. The keys of this table should be (state, action) tuples.
- "Learned Optimal Policy" (5 points).
Implement a method extractPolicy() in class MDP
that takes the QValues and creates a new hash table optPolicy whose keys are states
and whose values are actions. This hash table should give the optimal action for
each state.
- "Display of Q-Values" (5 points).
In a similar fashion to Question 2 with ValuesIteration, provide a method GW_QValues_string(Q_dict)
in RunApp.py to call the Q-Learning and return a string showing
a nicely formatted map of Grid World that shows the resulting Q values.
- "Display of Optimal Policy" (5 points).
Provide a method GW_Policy_string()
in RunApp.py that calls extractPolicy() and then displays
a nicely formatted map of Grid World that shows the resulting optimal policy.
- Extra Credit (10 points).
Formulate the
Towers of Humptulips problem and apply your ValueIteration
and Q-Learning methods to it. You do not have to produce a fancy map of values or
Q values for this problem, but you should print out the the results of ValueIteration
and Q-Learning by extracting and sorting the key-value pairs from your
V hash, QValues hash, and optPolicy hash.
Finally, demonstrate the optimal policy found by sending the agent into the problem space following the optimal policy and printing out the sequence of moves it makes in one or more episodes of
"exploitation" of the optimal policy.