Assignment 6: Value Iteration and Q-Learning

CSE 415: Introduction to Artificial Intelligence
The University of Washington, Seattle, Spring 2026

Assignment 6 Overview

CSE 415: Introduction to Artificial Intelligence

The University of Washington, Seattle, Spring, 2026

Due: Friday, May 22, via Gradescope at 11:59 PM.

Partnerships are optional for A6 (see "Partnerships").

Introduction

Files you’ll edit:
solver_utils.py Utilities for various reinforcement learning algorithms
a6_report.txt Starter template for your report
Files you should read but NOT edit:
solvers.py Reinforcement learning solvers for MDPs
toh_mdp.py Definitions for the TOH-world MDP
Files you will not edit:
autograder.py Assignment autograder
gui.py Graphical User Interface for the TOH-world MDP
test_cases Test cases directory to support the autograder

This assignment not only offers an opportunity to implement the Value Iteration algorithm, but also an opportunity to see how reinforcement learning techniques can be applied to problem solving of the sort we studied earlier with state-space search.

The interactions between a problem-solving agent and its problem-domain environment are modeled probabilistically using the technique of Markov Decision Processes. Within this framework, the assignment focuses on two approaches to having the agent maximize its expected utility: (A) by a form of planning, in which we assume that the parameters of the MDP (especially the transition model TT and the reward function RR) are known, which takes the form of Value Iteration, and (B) by an important form of reinforcement learning, called Q-learning, in which we assume that the agent does not know the MDP parameters.

The particular MDPs we are working with in this assignment are variations of a “TOH World”, meaning Towers-of-Hanoi World. We can think of an agent as trying to solve a TOH puzzle instance by wandering around in the state space that corresponds to that puzzle instance. If it solves the puzzle, it will get a big reward. Otherwise, it might get nothing, or perhaps a negative reward for wasting time. In Part 1, the agent is allowed to know the transition model and reward function. In Part 2, the agent is not given that information, but has to learn about good policies by exploring and seeing what states it can get to and how good they seem to be based on what rewards it can get and where states seem to lead to.

Getting Started


Download and unzip the starter code. Make sure you are using at least Python 3.9 which includes the Tkinter user-interface. Note that a specific dependency in Python 3.9 might not be available automatically on Windows so if Windows users see the error No module named 'tzdata', you can resolve it by running pip install tzdata.

In a command line, run

python gui.py

This will start up a graphical user interface, and the interface can help you debug. You can see that some menus are provided. Some menu choices are handled by the starter code, such as setting some parameters of the MDP. However, you will need to implement the code to actually run the Value Iteration and the Q-Learning and some menu buttons will not work correctly until you implement the corresponding functions.

TOH world picture.
Figure 1. An early description of the problem “La Tour d’Hanoi” from the 1895 book by Edouard Lucas. Your program will determine an optimal policy for solving this puzzle, as well as compute utility values for each state of the puzzle.

A Note on Python Type Annotation

The first thing you will probably notice is that our code that you will be interacting with, for the most part, is annotated with types. Python is a flexible language that’s great for prototyping small programs, and part of the flexibility comes from its nature as a dynamically typed language. In other words, all the type checking, if it exists at all, happens during runtime. This however can cause issues since it is hard to know what kind of arguments a specific function is expecting. Then students may have a hard time tracking down errors. So for this assignment we type annotated our starter code to help you determine what’s an appropriate input to a function. Furthermore, if you use an IDE with strong support for static inference such as PyCharm, you’ll see an error if your code doesn’t type check. If you wish to read a quick introduction on Type Annotation, this document is a good place to start. If you wish you type-check your code, you can consider the Python Software Foundation funded Python Type Checking program, mypy. You can install mypy with pip install mypy and type check your solution file with

mypy solver_utils.py

Note that mypy will throw a lot of errors if you haven’t completed all the questions. You may want to add some temporary values to suppress these errors before you actually implement those functions.

A Note on Style

With a handful of exceptions, our entire starter code is compliant with the official Python style guide PEP 8 and an industrial extension Google Python Style Guide. We encourage you to follow the same tradition. Again, if you use PyCharm, PEP 8 violations will be pointed out to you.

Autograding

The included autograder has a number of options. You can view it by

python autograder.py --help

Some commonly used flags are: --no-graphics, which disables displaying the GUI, -q which grades only one question, and -p, which grades a particular part.

Part 1: Value Iteration (35 points)

Part 1 is mainly concerned with value iteration. Now would be a good time to review the Bellman equations.

You can test your Part 1 implementations with

python autograder.py -p p1

Q1: Value Iteration Implementation (25 points)

In solver_utils.py, complete the function that computes one step of value iteration:

def value_iteration(
     mdp: tm.TohMdp, v_table: tm.VTable
) -> Tuple[tm.VTable, tm.QTable, float]:
     """..."""

Before you write any code, it would be a good idea to take a look at the toh_mdp.py file, which defines all the components of TOH world MDP. In particular, TohState, Operator, TohAction, and most importantly, TohMdp are the classes you will be interacting the most, and you should familiarise yourself with the way these objects interact with each other. When in doubt of the details of any functionality, the best way is to consult the documentation (provided as comments in the code) and the implementation itself.

Make sure you also read the documentation in the code to avoid some common pitfalls.

You can test your implementation with

python autograder.py -q q1

The autograder will display your value iteration process for a sample TOH MDP. You might notice that the autograder logs lots of additional info. You can suppress this by setting LOGLEVEL from INFO to WARNING in your shell, i.e.,

LOGLEVEL=WARNING python autograder.py -q q1

Note: Our reference solution takes 12 lines.

Q2: Extracting Policies (10 points)

In solver_utils.py, complete the function that extracts the policy from the value table:

def extract_policy(
        mdp: tm.TohMdp, q_table: tm.QTable
) -> tm.Policy:
    """..."""

You can test your implementation with

python autograder.py -q q2

The autograder will display your value iteration process for a sample TOH MDP as well as the policy during each iteration. You can observe how the policy changes as the values change. Finally, the autograder will simulate an agent’s movement using your learned policy. If your implementation is correct, your agent mostly follows the optimal trajectory. You can view the optimal path by selecting “MDP Rewards” -> “Show golden path (optimal solution)”.

Note: Our reference solution takes 2 lines.

Part 2: Q learning (70 points)

You can test your Part 2 implementations with

python autograder.py -p p2

Q3: Q update (20 points)

Q learning performs an update on the Q-values table upon observing a transition, which in this case is represented as a (s,a,r,s)(s, a, r, s') tuple.

Implement the q_update function, which updates the q_table based on the observed transition, with the specified learning rate alpha:

def q_update(
        mdp: tm.TohMdp, q_table: tm.QTable,
        transition: Tuple[tm.TohState, tm.TohAction, float, tm.TohState],
        alpha: float) -> None:

You can test your implementation with

python autograder.py -q q3

Note: Our reference solution takes 5 lines.

Q4: Extract Value Table from Q-value table (10 points)

Q learning only works with the Q-value table, but sometimes it is useful to know the (estimated) value for a certain state. Implement the extract_v_table function, which computes a value table from the Q-value table:

You can test your implementation with

python autograder.py -q q4

Note: Our reference solution takes 2 lines.

Q5: ϵ\epsilon-Greedy Learning (10 points)

At the core of on-policy Q learning is the exploration strategy, and ϵ\epsilon-greedy is a commonly used baseline method. In this assignment, your agent is assumed to perform on-policy learning, i.e., the agent actively participates (chooses the next action) when acting and learning, instead of passively learn by observing a sequence of transitions. Implement the choose_next_action below.

def choose_next_action(
        mdp: tm.TohMdp, state: tm.TohState, epsilon: float, q_table: tm.QTable,
        epsilon_greedy: Callable[[List[tm.TohAction], float], tm.TohAction]
) -> tm.TohAction:

Make sure you use the supplied epsilon_greedy function to sample the action. You should not need to use the random module. That is handled for you in the QLearningSolver.epsilon_greedy function in solvers.py. Make sure you read its code.

You can test your implementation with

python autograder.py -q q5

Note that this time the autograder will do a sanity check of your implementation first, and then use your exploration strategy to run for 10,000 Q learning steps with ϵ=0.2\epsilon=0.2. You will receive full credit if after 10,000 Q updates, your agent is able to learn the optimal policy on the solution path (it’s ok if the policy for other states are suboptimal).

Note: Our reference solution takes 4 lines.

Q6: Custom Epsilon (10 points)

If you use a constant ϵ\epsilon, your program will waste time exploring unnecessary states as time approaches infinity. A common way to mitigate this is to use a function for ϵ\epsilon that depends on the time step. For example, see slides 44-46 on GLIE: Greedy in the Limit with Infinite Exploration in this PPT presentation. Implement the custom_epsilon to design your function of ϵ\epsilon:

You can test your implementation with

python autograder.py -q q6

Same as Q5, the autograder is going to use your custom ϵ\epsilon function to perform Q learning, and you will receive full credit if after 10,000 Q updates, your agent is able to learn the optimal policy on the solution path (it’s ok if the policy for other states are suboptimal).

Note: Our reference solution takes 1 line.

Q7: Exploration Function (10 points)

While epsilon-greedy Q-Learning uses randomness to control the exploration, it is also possible to control the exploration deterministically. One way to do this is by using "exploration functions" that incentivize the agent to explore parts of the state space that have not been visited much as others.

Implement two functions in solver_utils.py. First, implement the exploration function itself, which returns an optimistically adjusted value for a Q-state based on its Q-value and how many times it has been visited:

def exploration_fn(u: float, n: int) -> float:

A typical form uses a decreasing bonus: return u + R_plus / math.sqrt(n + 1), where R_plus is an optimistic upper bound on any achievable reward. This ensures under-explored Q-states always receive a positive bonus that shrinks as visits accumulate, so the agent never hard-locks into a greedy policy before Q-values have converged. Then implement the action-selection function that uses it:

def choose_next_action_with_exploration_fn(
        mdp: tm.TohMdp, state: tm.TohState, q_table: tm.QTable,
        visit_counts: Dict[Tuple[tm.TohState, tm.TohAction], int],
        exploration_fn: Callable[[float, int], float]
) -> tm.TohAction:

This function should pick the action that maximizes exploration_fn(Q(state, action), visit_counts[(state, action)]) over all available actions, and increment visit_counts[(state, chosen_action)] before returning. When multiple actions tie for the highest adjusted value, break ties randomly (e.g., using random.choice()). The visit_counts dictionary is maintained for you in solvers.py — do not edit solvers.py. To activate this exploration strategy, set solver.use_exploration_fn = True on a QLearningSolver instance.

You can test your implementation with

python autograder.py -q q7

The autograder will sanity-check your exploration_fn and then run 5,000 Q-learning steps (3 disks, no noise, alpha=0.1). You will receive full credit if your agent learns the optimal policy on the solution path (it's ok if the policy for other states are suboptimal).

Note: Our reference solution takes 1 line for exploration_fn and 4 lines for choose_next_action_with_exploration_fn.

Q8: Custom Alpha -- learning rate (10 points)

Design and implement a scheme for gradually lowering the value of alpha (the learning rate) that is used in the Q-update formula. The purpose of starting with a relatively high value (close to 1) and gradually lowering it, is to immediately obtain plausible Q-value estimates (vs the zeros with which they have been initialized), but over time, to reduce the influence of new samples, since giving them a lot of influence makes the Q-updates "forget" the influence of samples further back in time.

def custom_alpha(n_step: int) -> float:

To activate your custom alpha schedule, set solver.use_custom_alpha = True and solver.alpha = None on a QLearningSolver instance.

You can test your implementation with

python autograder.py -q q8

Same as Q6, the autograder will run 10,000 Q-learning steps using your custom alpha schedule. You will receive full credit if your agent learns the optimal policy on the solution path.

Note: Our reference solution takes 1 line.

Part 3: Report (40 points)

Q9: Report (40 points)

The starter code includes a file a6_report.txt that serves as both your Q9 report and your Learning Diary Entry (LDE) for this assignment. You do not need to submit a separate LDE. Fill in all four parts of the template and submit it alongside solver_utils.py to Gradescope. Name the file a6_report_firstname_lastname.txt (submitting partner's name). If you are working in a partnership, include both partners' names and IDs in the header.

The template has four parts:

  • Part A — Learning Diary: Process (not Q9-graded): Goals and Activities — a brief narrative of what you set out to do and how your work unfolded.
  • Part B — Report Questions (40 points, graded): The six questions below. Questions 3 and 5 require you to run experiments and report numerical results; a short table is expected. Keep answers for the other questions to 3–4 sentences.
  • Part C — Learning Diary: Reflection (not Q9-graded): Challenges & Solutions, Reflections/Learnings, and Next Steps.
  • Part D — For the Record (not Q9-graded): Your raw AI prompting log (up to 100K bytes). Question 6 in Part B is your reflective summary; Part D is the supporting record.

Part B questions (40 points total):

  1. In the context of the Towers-of-Hanoi World MDP, explain how the Value Iteration algorithm uses the Bellman equations to iteratively compute the value table. (5 points)
  2. How did you decide your custom epsilon function? What thoughts went into that and what would you change to further optimize your exploration? If your function was strong, explain why. (5 points)
  3. Run Q-Learning with three different exploration strategies: constant epsilon 0.2, your custom epsilon schedule, and your exploration function. Use the 3-disk MDP with 20% noise, training for 10,000 steps in each case. Choose three states on or near the solution path, report the learned values for each strategy in a table, and explain any discrepancies you observe. (10 points)
  4. For what value of the discount factor gamma does Value Iteration fail to find a policy that reaches the goal? Explain why. (5 points)
  5. Explain how your custom alpha function works and whether it is effective in gradually lowering the learning rate without lowering it too much. Compare Q-learning results with and without your custom alpha after 10,000 training steps, using your Value Iteration results as a benchmark. (10 points)
  6. What parts of this assignment did you use AI assistance for? What did the AI get wrong, or what did you have to verify, correct, or rethink? (5 points)

Submission

Submit two files to Gradescope:

  • solver_utils.py — your implementation.
  • a6_report_firstname_lastname.txt — your completed report template. This file also serves as your Learning Diary Entry for A6; no separate LDE submission is required.

You can run the entire test suite with

python autograder.py

Partnerships

Partnerships of two are permitted for this assignment. If you are in a partnership, do the following: (1) In each source file that you turn in, include a comment at the top of the file of the form

  # Partnership? YES
  # Submitting partner: John Doe
  # Other partner: Jane Smith
and (2) indicate when you submit to GradeScope that this is a group submission and identify your partner.

Updates and Corrections

If needed, updates and corrections will appear here and/or in ED.