Assignment 6: AEMS

Heaven or hell
I can't tell.

In this assignment you will implement an online POMDP solver, named AEMS2.

Introduction

As we have seen in class, the agent can’t directly observe the world-states in a POMDP model, thus the agent maintains a posterior probability distribution over states to represent its belief of what world-state it is actually in. This posterior probability is called the belief state:

\(b_t(s) = P(s | s_0, a_0, z_1, a_1, ..., a_{t-1}, z_{t})\)

The belief state is updated in two steps after each action: first because of performing the action (\(a_t\)) and then based on the new observation (\(z_{t+1}\)) that the agent received because of this action:

\begin{align} b'_t(s) &= P(s | s_0, a_0, z_1, a_1, ..., a_{t-1}, z_{t}), a_{t}\\ &= \sum_{s'} P(s'|s_0, ..., a_{t-1}, z_t) P(s|a_t, s')\\ &= \sum_{s'} b_t(s) P(s|a_t, s') \end{align} \begin{align} b_{t+1}(s) &= P(s | s_0, ..., z_{t}, a_{t}, z_{t+1})\\ &= P(s | s_0, ..., a_{t-1}, z_{t}, a_{t}) P(s|a_t, z_{t+1})\\ &= b'_t(s) P(s|z_{t+1})\\ &\propto b'_t(s) P(z_{t+1}|s, a_t)\\ &\propto b'_t(s) O(s, a_t, z_{t+1}) \end{align}

The belief state represents the agent’s partial understanding of which state of the world it is actually in. Therefore, a POMDP may actually be seen as an MDP, whose states are the belief states (probability distributions over world states) of the original model. This means that in theory we could apply the same techniques used in solving MDPs to find the optimal policy of a POMDP model.

However, the belief state space is a continuous, real-valued probability distribution over states, which is of infinite size. As a result, MDP solvers such as VI, PI, and even RTDP are too computationally expensive. This is true even when we know the world dynamics encoded by the transition, reward functions, and observation functions. While we may use the same fundamental principles of search such as greedy action selection and admissible heuristics, we need to develop more complex algorithms with better convergence rates to operate on POMDP models.

Since the problem is so hard, an offline solver may never finish computing the complete optimal policy, even for a given initial state. We turn to an online POMDP solver, which interleaves planning with executing actions. By considering a number of actions in an anytime fashion, the agent is able to choose a good action when it must act in the real world. The online solver represents its ever-expanding investigation into the best actions by growing a search tree. Our goal is to come up with the most efficient way to approximate the true value function of the current belief state by using smart heuristics, and smart node expansion strategies.

And-Or Tree

To represent the search tree of an online POMDP solver, we can encode belief states during the search process (\(b\) and \(b’\)) as nodes in the tree. As the belief state is updated by two aspects, i.e. action and observation, there are two different types of belief nodes: OR nodes representing parts of the search where an action is picked (\(b_t \rightarrow b’_t\)) and AND nodes representing parts of the search where the observations are considered (\(b'_t \rightarrow b_{t+1}\)).

Note that since the agent can pick the actions, only one branch of an OR node would be optimal. In contrast, at an AND node, it may be possible to observe any observation so all branches need to be considered.

And-Or Tree: Triangles represent OR nodes where the agent should pick one action from all available ones. OR node contains the belief state. Circles represent AND nodes where the agent observes. The probability of each observation depends on the observation function and the posterior probability of states in the current node (\(b’\)). This And-Or tree has two heuristics, one upper and one lower bound (numbers in brackets).

Search: Heuristics and Pruning

Similar to what we saw with RTDP, we can estimate the value of each belief node by using an admissible heuristic (i.e. an upper bound) and update the bound after each expansion. The action selection strategy at the root is greedy as before. We note from class that QMDP always over-estimates (is an upper bound) so it can be used as a heuristic.

As we learned before, search performance is highly dependent of the heuristic. The problem is that finding a general tight upper bound is not easy. As a result, a greedy algorithm using the various available upper bounds does not perform well.

Recall alpha-beta pruning in adversarial search where we used a branch-and-bound mechanism to efficiently prune the search tree. We can utilize the same principles here and select two heuristics (one upper and one lower bound) which will enable us to prune nodes with their upper bound lower than the lower bound of at least one other node. In fact, there is an online POMDP algorithm that is based on pruning (Ros et al 2008). However, without sufficiently tight bounds, pruning turns out to not be very effective in practice.

One approach to improving branch-and-bound pruning is to try to minimize the difference between the upper and lower bounds, referred to as Error Minimization. We can think of this difference as the agent’s uncertainty about the value of an action, or alternatively, as the error of its estimated value.

Anytime Error Minimization Search (AEMS) is based on this idea. At each step it expands and updates the node with maximum error. In this assignment you will implement the AEMS2 algorithm. You need to read the original paper to fully understand it as a part of the assignment. However, we have already shown the main intuition behind it.

Question 1 (8 points) - Understanding AEMS2

Read the paper and answer the following questions in the writeup:

(2 points) What does “anytime” mean? Why is it important in an online solver?
(2 points) How is the error calculated in AEMS2? Is it just the difference between upper bound and lower bound? Write the formula.
(4 points) Write down the error term that AEMS2 would use for the unexpanded nodes in the And-Or tree figure (\(b_2\) to \(b_8\)). (Discount factor \(\gamma = 0.95\))

Implementation

Important: For this part of the assignment, you will need to have NumPy installed.

If you don't already have NumPy installed, you may install it through your distribution package manager if you are using Linux.

Alternatively, you can install NumPy using pip: pip install numpy (you may need to use pip3), or conda: conda install numpy.

Please use this link to download the zip file for the assignment code.

We will provide you the model parser, an environment simulator (sorry, not as fancy as Pacman), and a tight lower bound. You are required to implement an upper bound, a loose lower bound, and most importantly AEMS2 for an infinite horizon POMDP.

Files you'll edit:
`mdpsolver.py`	Both of the heuristics you implement.
`aems.py`	Your implementation of AEMS2
Files you will not edit:
`pomdp.py`	Store the model in a pomdp class (learn how T, O, and R are presented)
`offlineSolver.py`	General offline solver for POMDPs (look at the abstract methods)
`onlineSolver.py`	General online solver for POMDPs (look at the abstract methods)
`environment.py`	Changes the real state of the environment by the agent’s action, simulate and return an observation with gained instant reward
`policyReader.py`	Gives the value function or the best function based on a provided policy (used for PBVI as the lower bound).
`evaluate.py`	Evaluate a POMDP solver by running it multiple times.
`test.py`	Test and grade! You can check the tests in test folder.

The above builds on a general POMDP solver that works with any model expressed in a common POMDP format, developed by Anthony Cassandra (see here: http://www.pomdp.org/code/pomdp-file-spec.html.)

A few environments are provided in /examples/env:

`Tiger`	Two closed doors, opening one gives you a large reward. Behind the other one is a tiger! (Full description: Kaelbling et al., 1998)
`Hallway`	A navigation environment with 57 states. (Full description: QMDP paper by Littman et al., 1995)
`Hallway2`	A navigation environment with 57 states. (Full description: QMDP paper by Littman et al., 1995)
`TagAvoid`	Laser tag game for robots! (Full description: PBVI paper by Pineau et al., 03)
`RockSample_4_4`	Scientific exploration by a rover (Full description: HSVI paper by Smith & Simmons, 04)

You do not need to worry too much about the format. Just take a look at the pomdp parser in pomdp.py and learn how transition, observation, and reward functions are represented. Remember that we only work with indices in the code. Names of the actions, states, and observations do not matter.

Important: The parameter named “precision” determines the computing precision of your methods. Use it to avoid extra computation in value iteration or heuristic updates.

Question 2 (5 points) - Upper bound: QMDP

QMDP is one of the very first algorithms suggested for solving POMDPs. It is offline and very fast, but not really a very good approximation. Therefore, QMDP is often used to calculate a bound for other online solvers. It assumes that after only one action, the agent will learn what state it is in (an optimistic assumption). As a result, after one action the problem is basically an MDP. Consequently, the agent can use the value functions of the MDP of the same model. Therefore, the value function of a belief in an infinite horizon QMDP is:

\begin{align} Q^{\text{MDP}} (s, a) &= R(s, a) + \gamma \sum_{s'} T(s, a, s') V^{\text{MDP}} (s')\\ Q^{\text{MDP}} (b_t, a) &= \sum_{s} b_t(s) Q^{\text{MDP}} (s, a)\\ V^{\text{QMDP}} &= \max_a Q^{\text{MDP}} (b_t, a) \end{align}

Implement the QMDP class in mdpsolver.py. Although it is offline, you should make it as fast as possible, for example by using vectorization (using matrix multiplications instead of loops).

Hint: How can you obtain \(R(s_{\text{start}}, a)\) from your general 4D reward function? Is it worth saving it after one computation?

You can evaluate the performance of QMDP with the following command:

python3 evaluate.py QMDP [Environment] [Number of Runs] [precision]

For example the following command gives you the average total reward of QMDP for 1000 runs in Hallway problem (Computing precision = 0.01):

python3 evaluate.py QMDP Hallway 1000 0.01

You can test your QMDP implementation with the following command:

python3 test.py q2

Question 3 (3 points) - Lower bound: Worst possible future reward

Implement a lower bound as follows: After the (expected) immediate reward of the next action, the agent always receive the worst possible outcome (minimum of all \(R(s, a, s’)\) no matter what the belief is):

\begin{align} R_{\text{min}} &= \min_{a,s,s'} R(s, a, s')\\ V^{\text{worst}}(b_t) &= \max_a \sum_s b_t (s) R(s, a) + (\gamma + \gamma^2 + ...) R_{\text{min}}\\ \end{align}

Hint: What is \((\gamma + \gamma^2 + ...)\) for \(\gamma< 1\)?

Implement this bound in the MinMDP class. You can evaluate it with the following command:

python3 evaluate.py MinMDP [Environment] [Number of Runs]  [precision]

You can test your implementation by running the following command:

python3 test.py q3

Question 4 (8 points) - AEMS2

Implement AEMS2 using the bounds above. Remember you have a limited time for each action so try to optimize your code to be efficient.

You can evaluate your code with the following command:

python3 evaluate.py AEMS2 [LowerBound] [UpperBound] [time limit in sec] [Environment] [Number of Runs]  [precision]

For example this will return the average reward of AEMS2 using QMDP and MinMDP on Hallway environment over 200 runs, if for each action the solver has at most 100 ms (0.1 s) with computing precision of 0.01:

python3 evaluate.py AEMS2 MinMDP QMDP 0.1 Hallway 200 0.01

You can test your implementation by running the following command:

python3 test.py q4

Question 5 (8 points) - Playing with the time limit and heuristics

We have provided you the policy of a point based method for RockSample_4_4 and TagAvoid. Evaluate the performance of AEMS2 with PBVI as the lower bound for these environments for various time limits: 5 ms, 10 ms, 20 ms, 50 ms, 100 ms, 200 ms, and 500 ms. Compare using PBVI with MinMDP by following the same procedure with MinMDP.

Each evaluation should be done on for at least 500 runs for time limits less than or equal to 100ms. Minimum numbers of runs required for 200ms and 500ms are 200 and 100, respectively. Report the results in the form of two plots (one for each environment). Y axis is the average reward over the runs, and X axis is the time limit. Each plot contains two sets of 7 data points (connect the points). Show PBVI results with blue and MinMDP with red. Feel free to use any tool to generate the plots. Briefly explain the results (Compare the heuristics, and also time limits).Precision of .1 is sufficient for this report.

Example commands you need to run for this question:

python3 evaluate.py AEMS2 PBVI QMDP 0.25 RockSample_4_4 500 0.1

python3 evaluate.py AEMS2 MinMDP QMDP 0.01 TagAvoid 500 0.1

For longer time limits, you can evaluate AEMS2 for smaller number of runs and then report the average, e.g instead of 500 runs with one command use 50 runs 10 times and calculate the average by hand.