# Partially observable Markov decision process

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the operations research community, and was later taken over by the artificial intelligence and automated planning communities.

An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.

## Definition

### Formal definition

A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a tuple ${\displaystyle (S,A,T,R,\Omega ,O,\gamma )}$, where

At each time period, the environment is in some state ${\displaystyle s\in S}$. The agent takes an action ${\displaystyle a\in A}$, which causes the environment to transition to state ${\displaystyle s'}$ with probability ${\displaystyle T(s'\mid s,a)}$. At the same time, the agent receives an observation ${\displaystyle o\in \Omega }$ which depends on the new state of the environment with probability ${\displaystyle O(o\mid s',a)}$. Finally, the agent receives a reward equal to ${\displaystyle R(s,a)}$. Then the process repeats. The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward: ${\displaystyle E\left[\sum _{t=0}^{\infty }\gamma ^{t}r_{t}\right]}$. The discount factor ${\displaystyle \gamma }$ determines how much immediate rewards are favored over more distant rewards. When ${\displaystyle \gamma =0}$ the agent only cares about which action will yield the largest expected immediate reward; when ${\displaystyle \gamma =1}$ the agent cares about maximizing the expected sum of future rewards.

### Discussion

The difficulty is that the agent does not know the exact state it is in. This requires the agent to keep track of each observation received, in order to maintain a probability distribution, known as the belief state, over the possible states ${\displaystyle S}$. This is done by keeping track of the received observations over time, and the probability that they have been received given the particular belief maintained by the agent. In doing so, the belief state is Markovian, which enables the agent to reason again without regards to the past.

It is instructive to compare the above definition with the definition of a Markov decision process. An MDP does not include the observation set, and its reward function is a function of two states.

## Belief update

An agent needs to update its belief upon taking the action ${\displaystyle a}$ and observing ${\displaystyle o}$. Since the state is Markovian, maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. The operation is denoted ${\displaystyle b'=\tau (b,a,o)}$. Below we describe how this belief update is computed.

After reaching ${\displaystyle s'}$, the agent observes ${\displaystyle o\in \Omega }$ with probability ${\displaystyle O(o\mid s',a)}$. Let ${\displaystyle b}$ be a probability distribution over the state space ${\displaystyle S}$. ${\displaystyle b(s)}$ denotes the probability that the environment is in state ${\displaystyle s}$. Given ${\displaystyle b(s)}$, then after taking action ${\displaystyle a}$ and observing ${\displaystyle o}$,

${\displaystyle b'(s')=\eta O(o\mid s',a)\sum _{s\in S}T(s'\mid s,a)b(s)}$

## Belief MDP

A Markovian belief state allows a POMDP to be formulated as a Markov decision process where every belief is a state. The resulting belief MDP will thus be defined on a continuous state space, since there are infinite beliefs for any given POMDP.[1] The belief MDP is defined as a tuple ${\displaystyle (B,A,\tau ,r,\gamma )}$ where

Where ${\displaystyle \tau }$ and ${\displaystyle r}$ need to be derived from the original POMDP. ${\displaystyle \tau }$ is

where ${\displaystyle \Pr(o|a,b)}$ is the value derived in the previous section and

The belief MDP reward function (${\displaystyle r}$) is the expected reward from the POMDP reward function over the belief state distribution:

The belief MDP is not partially observable anymore, since at any given time the agent knows its belief, and by extension the state of the belief MDP.

### Policy and Value Function

The agent's policy ${\displaystyle \pi }$ specifies an action ${\displaystyle a=\pi (b)}$ for any belief ${\displaystyle b}$. Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When ${\displaystyle R}$ defines a cost, the objective becomes the minimization of the expected cost.

The expected reward for policy ${\displaystyle \pi }$ starting from belief ${\displaystyle b_{0}}$ is defined as

${\displaystyle V^{\pi }(b_{0})=\sum _{t=0}^{\infty }\gamma ^{t}r(b_{t},a_{t})=\sum _{t=0}^{\infty }\gamma ^{t}E{\Bigl [}R(s_{t},a_{t})\mid b_{0},\pi {\Bigr ]}}$

where ${\displaystyle \gamma <1}$ is the discount factor. The optimal policy ${\displaystyle \pi ^{*}}$ is obtained by optimizing the long-term reward.

${\displaystyle \pi ^{*}={\underset {\pi }{\mbox{argmax}}}V^{\pi }(b_{0})}$

where ${\displaystyle b_{0}}$ is the initial belief.

The optimal policy, denoted by ${\displaystyle \pi ^{*}}$, yields the highest expected reward value for each belief state, compactly represented by the optimal value function ${\displaystyle V^{*}}$. This value function is solution to the Bellman optimality equation:

${\displaystyle V^{*}(b)=\max _{a\in A}{\Bigl [}r(b,a)+\gamma \sum _{o\in \Omega }O(o\mid b,a)V^{*}(\tau (b,a,o)){\Bigr ]}}$

For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex.[2] It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate ${\displaystyle V^{*}}$ arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an ${\displaystyle \epsilon }$-optimal value function, and preserves its piecewise linearity and convexity.[3] By improving the value, the policy is implicitly improved. Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead.[4][5]

## Approximate POMDP solutions

In practice, POMDPs are often computationally intractable to solve exactly, so computer scientists have developed methods that approximate solutions for POMDPs.

Grid-based algorithms [6] comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered which are not in the set of grid points. More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states [7][8] For example, point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space.[9] Dimensionality reduction using PCA has also been explored.[10]

## POMDP uses

POMDPs model many kinds of real-world problems. Notable works include the use of a POMDP in assistive technology for persons with dementia [7][8] and the conservation of the critically endangered and difficult to detect Sumatran tigers.[11]

## References

1. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
2. Template:Cite thesis
3. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
4. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
5. {{#invoke:citation/CS1|citation |CitationClass=conference }}
6. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
7. {{#invoke:citation/CS1|citation |CitationClass=conference }}
8. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
9. {{#invoke:citation/CS1|citation |CitationClass=conference }}
10. {{#invoke:citation/CS1|citation |CitationClass=book }}
11. {{#invoke:Citation/CS1|citation |CitationClass=journal }}