Assignment 5: Markov Decision Processes and Q-Learning
CSE 415: Introduction to Artificial Intelligence
The University of Washington, Seattle, Winter 2018
The search theme continues here, except that now our agents operate in a world in which actions may have uncertain outcomes. The interactions are modeled probabilistically using the technique of Markov Decision Processes. Within this framework, the assignment focuses on two approaches to having the agent maximize its expected utility: (1) by a form of planning, in which we assume that the parameters of the MDP (especially T and R) are known, which takes the form of Value Iteration, and (2) by an important form of reinforcement learning, called Q-learning, in which we assume that the agent does not know the MDP parameters.
Part A (Value Iteration) is due Friday, February 16 via Catalyst CollectIt at 11:59 PM.
Part B (Q-Learning) is due Wednesday, February 21 via Catalyst CollectIt at 11:59 PM.
 
What to Do. Download the starter code, and try running the file TOH_MDP.py. This will start up a graphical user interface, assuming that you are running the standard distribution of Python 3.6 which includes the Tkinter user-interface technology. You can see that some menus are provided. Some of the menu choices are handled by the starter code, such as setting some of the parameters of the MDP. However, you will need to implement the code to actually run the Value Iteration and the Q-Learning.


 

PART A: Implement Value Iteration (40 points).

There is a starter file called YourUWNetID_VI.py. Rename the file according to this format and your own UWNetID. Also, change the name within TOH_MDP.py where this file is imported.

Complete the implementation of the functions there:

  1. one_step_of_VI(S, A, T, R, gamma, Vk)
    which can compute 1 iteration of Value Iteration from the given MDP information plus the current state values, which are provided in a dictionary Vk whose keys are states and whose values are floats.
  2. return_Q_values(S, A)
    which should return the Q-state values that were computed during the most recent call to one_step_of_VI. See the starter code for more details.
  3. extract_policy(S, A)
    Using the most recently computed Q-state values, determine the implied policy to maximize expected utility. See the starter code for more details.
  4. apply_policy(s)
    Return the action that your current best policy implies for state s.
PART B: Implement Q-Learning (40 points).

NOTE: UPDATED STARTER CODE WITH FULL SUPPORT FOR THE BASIC IMPLEMENTATIONS IN PART B WAS RELEASED ON WED. FEB. 14 at 11:45 PM. The link on this page to the starter code goes to the latest version. (A bug-fix release on Feb. 15 at 11:45 PM fixed a critical bug in which your choose_next_action function was not called after the last transition of an episode, even though you need the call in order to find out the reward for transitioning out of a goal state.)

Start by renaming the skeleton file YourUWNetID_Q_Learn.py to be your own file, and change the TOH_MDP.py file to import the renamed file (as well as your custom Value Iteration file).

Implement Q-learning to work in two contexts. In one context, a user can "drive" the agent through the problem space by selecting actions through the GUI. After each action is executed by the system, your handle_transition function will be called, and you should program it to use the information provided in the call to do your Q-value update.

In the second context, your function choose_next_action will be called, and it should perform two things: (1) use the information provided about the last transition to update a Q-value (similar to in the first context), and (2) select an action for the next transition. The action may be an optimal action (based on existing Q-values) or better, an action that makes a controlled compromise between exploration and exploitation.

In order control the compromise, you should implement epsilon-greedy Q-learning. You should provide two alternatives: (a) fixed epsilon, with the value specified by the GUI or the system (including possible autograder), and (b) custom epsilon-greedy learning. In the latter, your program can set an epsilon value and change it during the session, in order to gradually have the agent focus more on exploitation and less on exploration.

You are encouraged to develop another means of controlling the exploitation/exploration tradeoff. Implementing the use of an exploration function is optional and worth 5 points of extra credit. The GUI provides a means to control whether you are using it or not. However, it is up to you to handle it when it is selected.

One unusual feature of some of the MDPs that we use in studying AI is goal states (or "exit states") and exit actions. In order to make your code consistent with the way we describe these in lecture, adhere to the following rules when implementing the code that chooses an action to return:

  • (a) When the current state is a goal state, always return the "Exit" action and never return any of the other six actions.
  • (b) When the current state is NOT a goal state, always return one of the six directional actions, and never return the Exit action.
Here is an example of how you can test a state s to find out if it is a goal state:
if is_valid_goal_state(s):
  print("It's a goal state; return the Exit action.")
elif s==Terminal_state:
  print("It's not a goal state. But if it's the special Terminal state, return None.")
else:
  print("it's neither a goal nor the Terminal state, so return some ordinary action.")
Note that the Terminal_state variable is only available in the starter code version issued as a5-starter-code-Feb-14b.tar or later.

Here is the starter code for this assignment. Additional updates and fixes for any possible bugs in the starter code will be issued when they become available, and some additional features to help support doing the experiments are planned for release over the weekend.

Report for Part B: Q-Learning Experiments (20 points).

The key theme of this assignment is how Q-learning allows an agent to learn a good policy without actually knowing the T and R functions of the MDP. We explore this idea by first using Value Iteration to obtain the correct answers for V*(s), Q*(s, a), and pi*(s). This requires us to have T and R given in advance.

Then we try to find out what it will take for an agent to learn various approximations of these through Q-Learning without being given T and R. After you have implemented both Value Iteration and Q-Learning, you are ready to find out what it will take to get the Q-Learning to get close to the results of Value Iteration.

We'll address the following general question in several ways: How many iterations, and what processes for controlling the parameters of Q-Learning, will be needed to get the results of Q-Learning "close" to the optimal values?

Here is a list of criteria which may be used to decide whether the result of Q-Learning are "in the ballpark", good, or as good as the optimal values:

  1. Are the state values obtained using Q-Learning (QL) identical to those obtained by Value Iteration (VI)? Usually the answer is no, because these are floating point numbers that gradually converge, but often do not reach their asymptotes in practice.
  2. Is the mean-squared error of V*(s)-Vq(s) < theta, for some given version of state values from QL represented by Vq, and for some threshold theta?
  3. Is the policy pi[QL] extracted from the QL results identical to pi*?
  4. Is the percentage of mismatching policy values less than a threshold?
  5. Do the policies match along the golden path (solution path for the main goal)?
  6. Alternatively, does the VI policy based on some pre-convergence iteration match the policy pi[QL] along the silver path (path to the secondary goal)?
  7. Along the golden path, what fraction of the actions in pi[QL] match their counterparts in pi* ?
  8. Proper policy evaluation is a process of computing the expected value of each state according the policy. The values obtained during QL may be good enough to imply a policy that can then be evaluated to yield more accurate state values. While this is probably not a practical operation for a robot, it could be a useful offline technique for determining how good some QL results are.
Here is what to report with your Part B turn-in.

With NDISKS=2, two goal states, and gamma=0.9, using passive learning (the user drives the agent via the "console", (a) how many episodes are required, at a minimum, to achieve a policy which includes the silver path? (b) how many additional episodes, at a minimum, are required to make the policy switch to following the golden path?

With NDISKS=2, one goal state and gamma=0.9, If you use alpha=0.1 and epsilon=0.2 how many episodes of active Q-learning does it take to achieve a policy which matches the optimal policy along the golden path (optimal solution path)?

First, using NDISKS=3, one goal state, and gamma=0.9, find a way to make your agent learn, starting with all Q values at 0, a policy that includes the golden path. This may require a large number of transitions and episodes, and custom schedules for controlling alpha and epsilon in the Q-Learning. The learning must be all active, and not involve any user-driven transitions. If you used an exploration function in controlling the exploration/exploitation tradeoff, describe it next. To get 5 points of extra credit, not only implement the exploration function, but decribe the difference it made in how soon (in terms of numbers of episodes and transitions) the agent learned a policy that included the golden path. (You need to compare two runs ... one using the exploration function and one without; at least one of the runs should be finding the policy including the golden path.)

When you have achieved this, take a screen shot of the policy and the Q values your agent has learned. Then describe the details of your strategy, beginning with how your agent adjusted the alpha value. Then describe how it adjusted epsilon. Report the number of episodes used and the number of transitions.

What to Turn In.

For Part A, Turn in your Value Iteration Python file, named with the scheme YourUWNetID_VI.py. Also, turn in your slightly modified version of TOH_MDP.py, without changing its name. (The modification this time is that it imports your Value Iterations file. You can leave the import of YourUWNetID_Q_Learn as it is in the starter code for this first submission.) )

For Part B, Turn your Q_Learn Python file, named with the scheme YourUWNetID_Q_Learn, and a version of TOH_MDP.py that imports both of your custom modules (your Value Iteration module and your Q_Learn module.) Also turn in your report, which should be named in the related style: YourUWNetID_A5_report.pdf.

Do not turn in any of the other starter code files (Vis_TOH_MDP.py, TowersOfHanoi.py, or the actual "YourUWNetID_VI.py" or "YourUWNetID_Q_Learn.py").

Updates and Corrections:

Starter code last updated Feb. 23 with bug fixes for those who want to do automatic training until convergence. Previous update: Feb. 19, midnight with addition of an option to show the golden path. Previous update Feb. 19 at 2:45 PM, with a bug fix for the Use-Exploration-Function menu item.

Last major release: Feb. 18, 6:15 PM. This release is intended to be feature-complete; any additional releases will be for minor modifications or bug fixes rather than significant new features. In this release you can find new support for comparing policies, state values and Q values between the two forms of processing MDPs (Value Iteration and Q-Learning). There are also bug fixes for some annoying UI issues or sensitivity of orderings of data.

Previous major releases: Feb. 15 (fixing a bug with Q-values not updating after transitioning to the Terminal state); Feb. 14, 11:40 PM; spec. last updated Feb. 14, 11:45 PM). If necessary, additional updates and corrections will be posted here and/or mentioned in class, in GoPost, or via the mailing list.

Feedback Survey

After submitting your Part B files, please answer this survey