Assignment 6: Bayes' Rule, Markov Decision Processes, and Reinforcement Learning (Part 2)
CSE 415: Introduction to Artificial Intelligence
The University of Washington, Seattle, Spring 2016
The reading for the MDP and Reinforcement Learning part of this assignment is is Chapter 3 of Sutton and Barto (see the Readings webpage).
Part 2 due Friday, May 20 Saturday, May 21, via Catalyst CollectIt at 11:59 PM.
 
Part II. Programming (60 points).

  1. "MDP Enhancements" (15 points)

    Starting with the given code in Grid.py and MDP.py MDP.py , implement the following: (a) a method generateAllStates(self) to produce a set of all the states of any MDP. This should be added to the class MDP in the file MDP.py. The results should be stored in the member variable known_states. (Note that the random episode method there already finds some of the states and puts them in this set, but is not guaranteed to find them all.) Your method should use Breadth-First Search or Depth-First Search to find them all, starting from the start state. You may re-use any of your own code from previous assignments that you wish for this.
    (b) a method ValueIterations(self, discount, niterations). This method should also be added to the class MDP in MDP.py. It should start by creating a hash table in self.V, where each state is a key and the value for each state is set to 0. It should then perform up to niterations of Bellman updates to get an approximation of V*, the values associated with each state assuming optimal behavior from that state forward.

  2. "Grid World Value Iterations" (10 points)

    Show the results of applying your ValueIterations method to the Grid World MDP provided to you in starter file Grid.py. To do this, you should create a special output method in Grid.py that will accept the dictionary self.V (that you created in the previous problem) and print it as a nicely formatted map of the 4 columns by 3 rows of Grid World. Each value should be shown to 3 decimal places.

  3. "Q-Learning" (20 points).

    Add a new feature to MDP.py to implement Q-Learning. This should be callable with the following: Grid_MDP.QLearning(discount, nEpisodes, epsilon). Assume the following: Each episode starts in the start state and continues until the DEAD state is reached. The learning rate alpha will be computed at each transition according to alpha = 1/N(s,a), which is the reciprocal of the count of times that action a has been taken from state s so far in the training (in all of the episodes so far). Epsilon is the parameter for epsilon-greedy learning. Try epsilon = 0.05 and epsilon = 0.2. This means that when epsilon is 0.05, the optimal action (based on current Q values) will be taken with probability 0.95 and a random action will be taken with probabiliy 0.05. Create and use a new hash table self.QValues in class MDP. The keys of this table should be (state, action) tuples.

  4. "Learned Optimal Policy" (5 points).

    Implement a method extractPolicy() in class MDP that takes the QValues and creates a new hash table optPolicy whose keys are states and whose values are actions. This hash table should give the optimal action for each state.

  5. "Display of Q-Values" (5 points).

    As with ValuesIteration, provide a method in Grid.py to call the Q-Learning and print out a nicely formatted map of Grid World that shows the resulting Q values.

  6. "Display of Optimal Policy" (5 points).

    Provide a method in Grid.py that calls extractPolicy() and then displays a nicely formatted map of Grid World that shows the resulting optimal policy.

  7. Extra Credit (10 points).

    Formulate the Towers of Humptulips problem and apply your ValueIterations and Q-Learning methodsto it. You do not have to produce a fancy map of values or Q values for this problem, but you should print out the the results of ValueIterations and Q-Learning by extracting and sorting the key-value pairs from your V hash, QValues hash, and optPolicy hash.

    Finally, demonstrate the optimal policy found by sending the agent into the problem space following the optimal policy and printing out the sequence of moves it makes in one or more episodes of "exploitation" of the optimal policy.

Updates and Corrections The deadline was moved to Sat. May 21 on 20-May. If necessary, further updates and corrections will be posted here and/or mentioned in class or on GoPost.

Feedback Survey

After submitting your solution, please answer this survey