Assignment 6: Bayes' Rule, Markov Decision Processes, and Reinforcement Learning (Part 2)
CSE 415: Introduction to Artificial Intelligence
The University of Washington, Seattle, Spring 2017
The reading for the MDP and Reinforcement Learning part of this assignment is is Chapter 3 of Sutton and Barto (see the Readings webpage).
Part 2 due Friday, May 19 Monday, May 22, via Catalyst CollectIt at 11:59 PM.
 
Part II. Programming (60 points).

  1. "MDP Enhancements" (15 points)

    Starting with the given code in RunApp.py, Grid.py and MDP.py, implement the following in the file MDP.py:

    (a) a method generateAllStates(self) to produce a set of all the states of any MDP. (Although the theory of Markov Decision Processes does not require that this set be finite, you may assume that all our examples will have finite state spaces, in this assignment.) This should be added to the class MDP in the file MDP.py. The results should be stored in the member variable known_states. (Note that the random episode method there already finds some of the states and puts them in this set, but is not guaranteed to find them all.) Your method should use Breadth-First Search or Depth-First Search to find them all, starting from the start state. You may re-use any of your own code from previous assignments that you wish for this.

    (b) a method ValueIteration(self, discount, niterations). This method should also be added to the class MDP in MDP.py. It should start by creating a hash table in self.V, where each state is a key and the value for each state is set to 0. It should then perform up to niterations of Bellman updates to get an approximation of V*, the values associated with each state assuming optimal behavior from that state forward.

    Note that in Grid.py, there is a distinction made between an Action (which is part of the MDP formulation) and Operator (which we use to implement actions). The effect of an action is nondeterministic, in general. On the other hand, an operator has a deterministic effect. In Grid.py, an action gets mapped probabilistically (according to the transition function T) to one of the actions. If it turns out that the precondition of the randomly selected operator is not satisfied, then we define the result of the action to be the same state (no change of state). Thus we follow the same convention used in the Berkeley lectures. Note that you should NOT edit Grid.py, and you should NOT turn that in. The versions of MDP.py and RunApp.py will be tested with the original Grid.py file, by an autograder.

  2. "Value Iteration Report" (10 points)

    Next, show the results of applying your ValueIteration method to the Grid World MDP provided to you in starter file Grid.py. To do this, you should create a special output method GW_Values_string(V_dict) in runApp.py that will accept the dictionary self.V (that you created in the previous problem) and return a string that represents it as a nicely formatted map of the 4 columns by 3 rows of Grid World. Each value should be shown to 3 decimal places.

  3. "Q-Learning" (20 points).

    Add a new feature to MDP.py to implement Q-Learning. This should be callable with the following: Grid_MDP.QLearning(discount, nEpisodes, epsilon). Assume the following: Each episode starts in the start state and continues until the DEAD state is reached. The learning rate alpha will be computed at each transition according to alpha = 1/N(s,a), which is the reciprocal of the count of times that action a has been taken from state s so far in the training (in all of the episodes so far). Epsilon is the parameter for epsilon-greedy learning. Try epsilon = 0.05 and epsilon = 0.2. This means that when epsilon is 0.05, the optimal action (based on current Q values) will be taken with probability 0.95 and a random action will be taken with probabiliy 0.05. Create and use a new hash table self.QValues in class MDP. The keys of this table should be (state, action) tuples.

  4. "Learned Optimal Policy" (5 points).

    Implement a method extractPolicy() in class MDP that takes the QValues and creates a new hash table optPolicy whose keys are states and whose values are actions. This hash table should give the optimal action for each state.

  5. "Display of Q-Values" (5 points).

    In a similar fashion to Question 2 with ValuesIteration, provide a method GW_QValues_string(Q_dict) in RunApp.py to call the Q-Learning and return a string showing a nicely formatted map of Grid World that shows the resulting Q values.

  6. "Display of Optimal Policy" (5 points).

    Provide a method GW_Policy_string() in RunApp.py that calls extractPolicy() and then displays a nicely formatted map of Grid World that shows the resulting optimal policy.

  7. Extra Credit (10 points).

    Formulate the Towers of Humptulips problem and apply your ValueIteration and Q-Learning methods to it. You do not have to produce a fancy map of values or Q values for this problem, but you should print out the the results of ValueIteration and Q-Learning by extracting and sorting the key-value pairs from your V hash, QValues hash, and optPolicy hash.

    Finally, demonstrate the optimal policy found by sending the agent into the problem space following the optimal policy and printing out the sequence of moves it makes in one or more episodes of "exploitation" of the optimal policy.

Updates and Corrections The deadline for Part 2 was moved to Monday, May 22. If necessary, more updates and corrections will be posted here and/or mentioned in class or on GoPost.

Feedback Survey

After submitting your solution, please answer this survey