The Towers of Humptulips
CSE 415: Introduction to Artificial Intelligence
The University of Washington, Seattle, Spring 2016
Here, we define the Towers of Humptulips problem. Its problem space is essentially that of a 4-disk Towers of Hanoi instance. There are some variations to the story, as well. First of all the "pegs" here are cedar stumps, which means that the disks are pretty large, and there is a significant cost to some transitions. Furthermore, the Humptulips tribe has determined that it is bad luck to ever have (a) the smallest and largest disks together without others on the same stump, while the other two disks are on their own separate stumps. The little disk perhaps represents a small creature who would be eaten by the big-disk monster, were it not for one of the other creatures (middle-sized disks) protecting it. As a consequence, certain states are almost taboo, in terms of their reward values. By the way, the reward function R(s, a, sp) in this problem yields a reward that depends only on two things: the size of the disk moved and the new state reached.

There is a -50 reward for arriving in one of these 6 states:

    [[4,1],[2],[3]], [[4,1],[3],[2]]
    [[2],[4,1],[3]], [[3],[4,1],[2]]
    [[2],[3],[4,1]], [[3],[2],[4,1]]
There is a 1000 point reward for arriving in the state.
    [[],[],[4,3,2,1]]
Arriving in any other state does not contribute to the reward. However, the part of the reward that depends on the disk moved is defined this way: Moving the small disk: -1; the second smallest disk: -2; the second largest disk: -4; the largest disk: -8. To determine the reward R(s, a, sp), just add the cost of the disk moved to the reward associated with destination state sp.

Of course, learning to solve this problem means learning an optimal policy for moving the disks to get to the 20-point goal state without passing through any of the taboo states.

You should assume that the actions for this problem are the following:

A1-2. Try to move a disk from Stump 1 to Stump 2.
A1-3. Try to move a disk from Stump 1 to Stump 2.
A2-1. Try to move a disk from Stump 2 to Stump 1.
A2-3. Try to move a disk from Stump 2 to Stump 1.
A3-1. Try to move a disk from Stump 3 to Stump 1.
A3-2. Try to move a disk from Stump 3 to Stump 2.
      
You may also assume that if the action's normal operator has its precondition satisfied, then that operator is applied with probability 1. If the operator's precondition is not satisfied, then the action does not change the state. This is the "deterministic" transitions model.
You may wish to try the following rule, in order to find out the effects of randomness: The normal operator is selected with probability 0.5, and there is a 0.1 chance of each of the other operators being selected. This is the "noisy" transitions model. No matter which operator gets randomly selected by the MDP simulation, that operator either produces a new state (if its precondition is satisfied) or does not change the state, due to its precondition not being satisfied.