From: Danny Wyatt (danny_at_cs.washington.edu)
Date: Sun Nov 23 2003 - 22:21:56 PST
Acting Optimally in Partially Observable Stochastic Domains
Anthony R. Cassandra, Leslie Pack Kaelbling, Michael L. Littman
Summary
The authors describe how a POMDP over environment state space can be
recast as an MDP over belief state space, they present a form of value
iteration that partitions continuous belief space into sets of belief
states that have identical optimal actions, and they show how these
partitions can be represented as a policy graph.
Important Ideas
Searching through belief space returns the problem to full
observability, since beliefs are always known. Rewarding belief states
also has the benefit of collapsing environment-based rewards with
information-gathering rewards, so the same policy can weight the two
together.
Approximating an optimal value function with vectors partitions belief
space into sets that share an optimal action, and often transition to
the same next belief set after that action. This allows the entire
policy to be represented with a small finite automaton that no longer
even needs to consider state probabilities based on observations.
Flaws
I couldn't follow all the summary details of how their Witness algorithm
works with value iteration, but that did not seem to be the main point
of the paper.
Open Questions
Personally, I would like to hear more about any work towards some kind
of "unified theory of MDPs" that considers all of the different
techniques we've learned about so far. (Of course, the authors of this
paper can't be blamed for not predicting the later papers we've read.)
This archive was generated by hypermail 2.1.6 : Sun Nov 23 2003 - 22:21:17 PST