Due: Monday
12/15/03 2:30pm –
hand in to Alicen Smith, room 546.

Instructions: Open
book, but please no collaboration or discussion of the problems except via email
to Dan (who will likely answer to everyone). If any problem is underspecified,
feel free to make assumptions. Just state and justify your assumptions.

1.In our
discussion of Graphplan, we talked about its relation
to constraint satisfaction techniques. This problem explores another such connection:
the connection between mutual-exclusion reasoning and k-consistency
preprocessing. Consider the problem of determining if there exists an n-step
parallel plan to a planning problem (i.e. a plan which achieves its goal by executing
zero, one or more actions per time step, for n time steps). This can be
expressed as a CSP in several ways, one of which has Boolean action and
proposition variables. If there are |A| actions, then one uses n|A| action variables and the i-j^{th}such variable denotes whether the ith action is true at the jth
time step. (Note that no-op actions must be considered to be members of the set
of actions, A). One requires (n+1)|P| proposition
variables, which similarly denote whether proposition I is true or false at a
time step. Constraints specify that the initial conditions are true at time
step 0, that the goals hold at time n, that an action implies all of its
preconditions, that an action implies all of its effects, that if a proposition
is true at time step t, then there exists an action at the previous level which
makes that proposition true (this could be a no-op, which is an action having P
as sole precondition and effect, and that two action variables may not both be
true at a time step if one action deletes the other’s preconditions.

a.[2 points]
Given this model, do you need to explicitly record constraints saying that two
action variables may not be both be true at a time
step if one deletes a proposition that the other makes true?Explain in one clear and concise English
sentence.

b.[6 points]
Once we write the complete set of primitive constraints (i.e. including those
from the previous question if necessary and perhaps some others which Dan has
forgotten) one could solve the planning problem by using any constraint
satisfaction algorithm: e.g. conflict-directed backjumping,
etc. However, it might be possible to preprocess the CSP, simplifying it in a
way (e.g. by eliminating inconsistent values from the domains of certain
variables) that would speed the subsequent search. For example, one might use
arc consistency, or path consistency, or k-consistency for some k.Then again, one might view traditional Graphplan mutual exclusion reasoning as a form of
preprocessing. I.e. if we used a planning to generate a subset of the action
and proposition variables above, one can think of that process as eliminating
the “true” value from the domain of variables which aren’t in
the graph. Fred Foosball has gone on record as stating “There exists a
constant, m, such that that mutual exclusion reasoning prunes exactly the
values that m-consistency would”. Do you think Fred is right, partly
right, or dead wrong?Explain as clearly
as possible.

2.Consider
an undiscounted MDP having three states X, Y, Z with rewards -1, -2, and 0
respectively. State Z is terminal, but two actions (a and b) are available for
X and Y. Executing a in X takes the agent to Y with
0.8 probability, otherwise it stays put. Executing a
in Y takes the agent to X with 0.8 probability, otherwise it stays put.
Executing b in either X or Y takes the agent to Z with 0.9 probability,
otherwise it stays put.

a.[5
points] Assuming an initial policy has the policy of always executing
action b, perform policy iteration, showing each step
in full, until you arrive at the optimal policy. What is value of each
state?You’ll be graded on
how clearly you work through the policy iteration algorithm, so be neat.

b.[2
points] What happens to policy iteration if the initial policy has a in both
states? (Answer with one clear and concise English sentence).

c.[2
points] Does discounting change things? How? (Answer with one clear and concise
English sentence).

d.[2
points] Does the numeric value of the discount factor affect the final policy?
(If so, explain with one clear and concise English sentence).

3.The
textbook defines the Bellman equation (eq 17.5) for a
discounted MDP assuming that the reward is a function of a state, but what if
the reward is a function of the source state, action executed and
the outcome state: R(s, a, s’)?

a.[2
points] Write the Bellman equation for this case

b.[5
points] Show how an MDP with reward function R(s, a, s’) can be
mechanically transformed into an equivalent MDP whose reward function is R(s),
such that optimal policies in the new MDP correspond to optimal policies in the
original MDP. Hint, this is a little bit like the
conversion from POMDPs into an MDP over belief space.
You’ll need to define a new space for the new MDP.

4.Consider
playing tic-tac-toe against an opponent who plays randomly. Specifically,
suppose that the opponent chooses any open space with uniform probability,
unless the move is forced (in which case it makes the obvious correct move).

a.[3
points] Formulate the problem of learning an optimal tic-tac-toe strategy in
this case as a Q-learning task.

b.[1 point]
Will your program succeed if the opponent plays optimally rather than randomly?

c.[1 point]
What if your opponent changed its strategy after playing a while.
Under what condition would your agent adapt to the new strategy?

d.[1 point]
Is there a way to exploit the fact that the tic-tac-toe board,
is symmetric? (Answer with one clear and concise English sentence, describing
what you might do and another sentence explaining the benefit(s).

e.[1 point]
Could symmetry analysis hurt the performance of the learner? (Explain with one
clear and concise English sentence).