Assignment 6: Bayes' Rule, Markov Decision Processes, and Reinforcement Learning (Part 2)
|
CSE 415: Introduction to Artificial Intelligence
The University of Washington, Seattle, Spring 2016
|
|
The reading for the MDP and Reinforcement Learning part of this assignment is
is Chapter 3 of Sutton and Barto (see the Readings webpage).
|
Part 2 due Friday, May 20 Saturday, May 21, via
Catalyst CollectIt
at 11:59 PM.
|
Part II. Programming (60 points).
- "MDP Enhancements" (15 points)
Starting with the given code in
Grid.py
and
MDP.py
MDP.py
, implement the following:
(a) a method generateAllStates(self) to produce a set of all the states of any
MDP. This should be added to the class MDP in the file MDP.py.
The results should be stored in the member variable known_states.
(Note that the random episode method there already finds some of the states
and puts them in this set, but is not guaranteed to find them all.)
Your method should use Breadth-First Search or Depth-First Search to find them all,
starting from the start state. You may re-use any of your own
code from previous assignments that you wish for this.
(b) a method ValueIterations(self, discount, niterations). This method should also be added
to the class MDP in MDP.py. It should start by creating a hash table in self.V, where each state
is a key and the value for each state is set to 0. It should then perform up to niterations
of Bellman updates to get an approximation of V*, the values associated with each state assuming
optimal behavior from that state forward.
- "Grid World Value Iterations" (10 points)
Show the results of applying your ValueIterations method to the Grid World MDP provided to
you in starter file Grid.py.
To do this, you should create a special output method in Grid.py that will accept
the dictionary self.V (that you created in the previous problem) and print it as a
nicely formatted map of the 4 columns by 3 rows of Grid World. Each value should be
shown to 3 decimal places.
- "Q-Learning" (20 points).
Add a new feature to MDP.py to implement Q-Learning. This should be callable with the
following: Grid_MDP.QLearning(discount, nEpisodes, epsilon). Assume the following:
Each episode starts in the start state and continues until the DEAD state is reached.
The learning rate alpha will be computed at each transition according to
alpha = 1/N(s,a), which is the reciprocal of the count of times that action a has been
taken from state s so far in the training (in all of the episodes so far).
Epsilon is the parameter for epsilon-greedy learning. Try epsilon = 0.05 and
epsilon = 0.2. This means that when epsilon is 0.05, the optimal action (based on
current Q values) will be taken with probability 0.95 and a random action will be
taken with probabiliy 0.05. Create and use a new hash table self.QValues in
class MDP. The keys of this table should be (state, action) tuples.
- "Learned Optimal Policy" (5 points).
Implement a method extractPolicy() in class MDP
that takes the QValues and creates a new hash table optPolicy whose keys are states
and whose values are actions. This hash table should give the optimal action for
each state.
- "Display of Q-Values" (5 points).
As with ValuesIteration, provide a method in Grid.py to call the Q-Learning and print out
a nicely formatted map of Grid World that shows the resulting Q values.
- "Display of Optimal Policy" (5 points).
Provide a method in Grid.py that calls extractPolicy() and then displays
a nicely formatted map of Grid World that shows the resulting optimal policy.
- Extra Credit (10 points).
Formulate the
Towers of Humptulips problem and apply your ValueIterations
and Q-Learning methodsto it. You do not have to produce a fancy map of values or
Q values for this problem, but you should print out the the results of ValueIterations
and Q-Learning by extracting and sorting the key-value pairs from your
V hash, QValues hash, and optPolicy hash.
Finally, demonstrate the optimal policy found by sending the agent into the problem space following the optimal policy and printing out the sequence of moves it makes in one or more episodes of
"exploitation" of the optimal policy.
|
Updates and Corrections
The deadline was moved to Sat. May 21 on 20-May.
If necessary, further updates and corrections will be posted here and/or mentioned in class or on
GoPost.
|
Feedback Survey
After submitting your solution, please answer this
survey
|