%
% Dear author,
%
% All the comments below are available on our WEB sites:
% in the US: http://www.ams.sunysb.edu/~feinberg/MDP
% in Israel: http://www.ee.technion.ac.il/~adam/MDP
% under "information for authors".
%
% Below please find a LaTeX file with the basic definitions and notation
% for the MDP book. This is a second, revised version. It does not contain
% examples that are irrelevant to the book. However, it does have more
% background material that you can refer to in your paper.
%
% This file is for LaTeX2e (the new LaTeX). If you have an older version,
% please use the LaTeX2.09 file (called notat2.tex).
% If you run LaTeX on this file, you should produce 8 pages, including
% examples and notation.
%
% In the file we show how to use the "index" command, both for
% a standard entry, as well as for a sub-entry. This is not intended to be
% complete: just to show how to use it. Note that you will NOT SEE
% an index when you run LaTeX. If you want to see it:
% 1. Uncomment (remove the "%") before the "\printindex" command
% 2. Run LaTeX as usual
% 3. Now give the command makeindex "filename": here "filename" is the name
% of the file you are editing, WITHOUT the ".tex" extension.
% 4. Run LaTeX one more time. You will now see an index as the last page.
% If the program reports errors related to the "index" or"makeindex" options,
% you can comment out lines 3,4 and uncomment line 5. In this case you can
% still enter indexing commands, but the program will not process them. The
% advantage is that we will be able to use them in our final editing.
%
% Feel free to contact Adam with any questions.
%
% Best wishes,
% The editors.
%
% Notation file, version of November 29, 1999.
%
\documentclass[11pt]{report}
%
\usepackage{makeidx,showkeys}
\makeindex
% \renewcommand{\index}[1]{}
%
\iftrue %%%%%%% If you do not have the "amsmath" or "amsfonts" package:
% \iffalse %%%% Comment out the "iftrue" line, uncomment the "iffalse" line
\usepackage{amsmath}
\usepackage{amsfonts}
\newcommand{\Iref}[1]{Ref- #1{}}
\let\idef=\emph
%
% Definitions in case you have the ams packages
%
\newcommand{\field}[1]{\mathbb{#1}}
\DeclareMathOperator{\PR}{\field{P}} % Probability
\DeclareMathOperator{\E}{\field{E}} % Expectation
\def\N{\field{N}} % nonnegative integers
\def\R{\field{R}} % Blackboard R - Reals
\def\F{\field{F}} % Functions/stationary policies
\def\A{\field{A}} % Action space
\def\X{\field{X}} % State space
\def\cA{{\cal A}} % action sigma-field
\def\cX{{\cal X}} % state sigma-field
%
% Definitions in case you do not have the ams packages
%
\else
\def\PR{\mathop{\rm I\kern -0.20em P}\nolimits} % Probability
\def\E{\mathop{\rm I\kern -0.20em E}\nolimits} % Expectations
\def\N{\mathop{\rm I\kern -0.20em N}\nolimits} % Nonnegative integers
\def\R{\mathop{\rm I\kern -0.20em R}\nolimits} % Blackboard R - Reals
\def\F{\mathop{\rm I\kern -0.20em F}\nolimits} % Functions/stationary polices
\def\A{{A}} % Action space
\def\X{{X}} % State space
\def\cA{{\cal A}} % action sigma-field
\def\cX{{\cal X}} % state sigma-field
\fi
%
% Definitions that work with or without the packages
%
\def\PS{\Pi} % Policy space
%
% Theorem environments etc. This controls the number of theorems, lemmas, etc.
%
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
%
% How to start and end a proof:
%
\newcommand{\Pf}{\paragraph{{\bf Proof.}}} % Starting a proof
\newcommand{\blot}{\hfill{\vrule height .9ex width .8ex depth -.1ex }}
\newcommand{\EndPf}{\hfill $\blot$ \medskip} % End of proof symbol
%
% New definitions by Adam:
\newcommand\bydef{\stackrel{{def}}{=}}
%
% Information about the authors
%
\author{Eugene Feinberg\\
SUNY Stony Brook\\
Stony Brook etc.\\
\and
Adam Shwartz\\
Electrical Engineering.\\
Technion Etc.}
%
\title{Definitions and notation}
%
\begin{document}
%
\maketitle
%
\chapter{Introduction}\label{ch:intro}
%
\begin{abstract}
In this chapter we describe the basic structure of a Markov Decision
Process. We introduce the notation and then give a quick tour of this volume.
At this point, we use this as information for authors, with examples.
\end{abstract}
\section{Introduction}
%
In this section we shall introduce the volume, its purpose, etc.
including a heuristic description of what an MDP is (in contrast to the
formal presentation of section~\ref{C1sec:notation}).
%
\section{Notation}\label{C1sec:notation}
%
Let $\N=\{0,1,\ldots\}$ and let $\R^n$ be an $n$-dimensional Euclidean space,
$\R=\R^1$.
%
A Markov Decision Process\index{Markov Decision Process|idef} (MDP)
is defined through the following objects:
a state space\index{state space|idef}\index{space!state|idef} $\X$;
an action space\index{action
space|idef}\index{space!action|idef}\index{action!space|idef} $\A$;
sets $\A(x)$ of available
actions\index{action!available|idef}\index{space!available actions|idef}
at states $x\in \X$;
transition probabilities,\index{transition probability|idef}
denoted by $p(Y | x,a)$;
reward functions $ r(x,a)$ denoting the one-step
reward\index{reward|idef}\index{reward!one step|idef}
using action $a$ in state $x$.
The above objects have the following meaning.
There is a stochastic system with a state space $\X$.
When the system is at state $x\in\X$, a decision-maker selects an action
$a$ from the set of actions $\A(x)$ available at state $x$.
After an action $a$ is selected, the system moves to the next state
according to the probability distribution $p(\cdot|x,a)$ and the
decision-maker collects a one-step reward $r(x,a)$.
The selection of an action $a$ may depend on the current state of the system,
the current time, and the available information about the history of the system.
At each step, the decision maker may select a particular action or, in a more
general way, a probability distribution on the set of available actions $\A(x)$.
Decisions of the first type are called nonrandomized and decisions of the
second type are called randomized.
\paragraph{Discrete MDPs.} An MDP is called finite if the state and action sets are
finite. We say that a set is discrete if it is finite or countable.
An MDP is called discrete if the state and action sets are discrete.
A significant part of research and applications related to
MDPs deals with discrete MDPs.
For discrete MDPs, we do not need additional measurability assumptions on the
major objects introduced above. Readers who are not familiar with measure
theory can still read % most of
the papers of this volume, since most of the papers deal with discrete MDPs:
for the other papers, the
results may be restricted to discrete state and action sets.
For a discrete
state space $\X$ we denote the transition probabilities by
$p(y | x,a)$ or $p_{xy}(a)$, and use (in addition to
$x,y$) also the letters $i,j,k $ etc.\ to denote states.
Unless mentioned otherwise, we always assume that $p(\X | x,a)=1$.
The time parameter is $t,s $ or $n\in\N$ and a
trajectory\index{trajectory|idef}
is a sequence $x_0a_0x_1a_1\ldots\ $.
The set of all trajectories is $H_\infty=(\X \times \A)^\infty $.
A trajectory of length $n$ is called a history\index{history|idef}, and denoted by
$h_n = x_0 a_0 \ldots x_{n-1} a_{n-1} x_n $.
Let $H_n = \X \times (\A \times \X)^{n}$ be the space of
histories\index{history} up to epoch $n\in\N$.
A nonrandomized policy\index{policy|idef} is a sequence of
mappings $\phi_n$,
$n\in \N$, from $H_n$ to $\A$ such that
$\phi_n (x_0 a_0 \ldots x_{n-1} a_{n-1} x_n) \in \A(x_n)$.
If for each $n$ this %function
mapping depends only on $x_n$, then the
policy $\phi$ is called Markov. In other words,
a Markov policy $\phi$ is defined by mappings
$\phi_n:\ \X \to \A$ such that $\phi_n (x) \in \A (x)$ for all
$x \in \X$, $n=0,1,\ldots\ $. A Markov policy $\phi$
is called stationary if the $\phi_n$ do not depend on $n$.
A stationary policy is therefore defined by a single mapping
$\phi:\ \X \to \A$ such that $\phi (x) \in \A (x)$ for all $x \in \X$.
%
We denote by $\PS$, $\PS^M,$ and $\PS^S$ the sets of all\index{policy
space|idef}\index{space!policy|idef}
nonrandomized, Markov, and stationary policies respectively.
We observe that $\PS^S \subseteq \PS^M \subseteq \PS$.
As mentioned above, by selecting actions randomly,
it is possible to expand the set of policies.
A randomized policy\index{policy!randomized|idef} $\pi$ is a sequence of
transition probabilities
$\pi_n ( a_n | h_n)$ from $H_n$
to $\A$, $n\in\N$, such that
$\pi_n ( \A (x_n)| x_0 a_0 \ldots x_{n-1} a_{n-1} x_n)=1$.
A policy $\pi$ is called randomized Markov if
$\pi_n (a_n | x_0 a_0 \ldots x_{n-1} a_{n-1} x_n) = \pi_n (a_n | x_n) $.
If $\pi_m (\cdot | x ) = \pi_n (\cdot | x)$ for
all $m,n\in\N$ then the Markov policy $\pi$ is called randomized stationary.
A randomized stationary policy $\pi$ is thus defined by a transition
probability $\pi$ from $\X$ to
$\A$ such that $\pi(\A (x) | x) = 1$ for all $x \in \X$.
%
We denote by $\PS^R$, $\PS^{RM}$, $\PS^{RS}$ the sets of
all randomized, randomized Markov, and randomized stationary policies
respectively. We have that $\PS^{RS} \subseteq \PS^{RM} \subseteq \PS^{R}$,
and in addition $\PS^S \subseteq \PS^{RS} $, $\PS^M \subseteq \PS^{RM}$,
and $\PS \subseteq \PS^R$.
Note that, while we try to be consistent with the above definitions,
there is no standard terminology for policies: in particular,
there is no general agreement as to whether ``stationary" implies
nonrandomized or, more generally, whether the ``default" should be randomized
(the more general case) or nonrandomized.
The following additional terms are sometimes also used:
pure\index{pure policy|idef} policy means nonrandomized;
deterministic\index{deterministic policy|idef}
policy means (nonrandomized) stationary.
The stochastic process evolves as follows.
If at time $n$ the process is in state $x $, having followed the history
$h_n $, then an action is chosen (perhaps randomly) according to the policy
$\pi $. If action $a$ ensued, then at time $ n+1 $ the process will be in
the state $y$ with probability $ p ( y | x , a ) $.
Given an initial state $x$ and a policy $\pi$, the ``evolution rule" described
above defines all finite-dimensional distributions
$ x_0, a_0,\ldots,x_n $, $n\in\N$.
Kolmogorov's extension theorem %, Neveu \cite[Section 5.1]{ne},
guarantees that any initial state $x$ and any policy $\pi$ define a stochastic
sequence $x_0a_0x_1a_1\ldots\ $. We denote by
$\PR_x^\pi$ and %(sometimes called the ``strategic" measure).
$\E_x^\pi$ respectively the probabilities and expectations related to this stochastic sequence; $\PR_x^\pi \{x_0=x\}=1$.
Any stationary policy $\phi$ defines a Markov chain with transition
probabilities
$p_{xy}(\phi)=p(y| x,\phi(x))$ on the state space $\X$.
A randomized stationary policy $\pi$ also defines a Markov chain with the state space $\X.$ In the latter case, the transition probabilities are
$ p_{xy}(\pi)=\sum_{a\in\A(x)} \pi (a)p(y| x,a)$.
We denote by $P(\pi)$
the transition matrix with elements $\{p_{xy}(\pi)\}$.
The limiting matrix
\begin{equation}
Q(\pi)=\lim_{N\to\infty} {\frac 1 N} \sum_{n=0}^{N-1} P^n(\phi).
\end{equation}
always exists and it is stochastic if $\X$ is finite;
Chung~\cite[Section 1.6]{chung}.
%
Let $f$ be a terminal reward\index{reward!terminal}
function and $\beta$ be a discount factor.
We denote by $v_N(x,\pi,\beta,f)$ the
expected total reward\index{reward!expected total|idef} over the first $n$
steps, $n\in\N$:
%
\begin{equation}\label{C1e:DefFinRew}
v_N(x,\pi,\beta,f)=\E_x^\pi \left [ \sum_{n=0}^{N-1}
\beta^n r(x_n,a_n)+\beta^N f(x_N)\right ] ,
\end{equation}
whenever this expectation is well-defined.
If $\beta\in [0,1[$ then we deal with expected total discounted
reward.\index{reward!discounted|idef} If $\beta=1$,
we deal with expected total undiscounted reward or simply total reward.
If the discount factor $\beta \in [0,1] $ is fixed,
we usually write $ v ( x , \pi ) $ instead of $ v ( x , \pi , \beta ) $.
The expected total reward over an infinite horizon is
\begin{equation}\label{C1e:DefDiscCost}
v ( x , \pi ) = v ( x , \pi , \beta ) = v_\infty ( x , \pi , \beta , 0 ) \, .
\end{equation}
%
Conditions for the total reward $ v ( x , \pi , 1 ) $ to be well-defined are
usually stronger than the conditions
that ensure that total discounted rewards
$ v ( x , \pi , \beta ) $, $ 0 \le \beta < 1 $, are well-defined.
The expected reward per unit time is
\begin{equation}\label{C1e:DefAvRew}
w ( x , \pi ) = \liminf_{n \to \infty } {\frac 1 N} v_N ( x , \pi , 1 , 0 ).
\end{equation}
If a performance measure $g(x,\pi)$ is defined for all policies $\pi$,
we denote
\begin{equation}\label{C1e:maxValue}
G ( x ) = \sup_{\pi \in \PS^R } g ( x , \pi ) .
\end{equation}
In terms of the performance measures defined above, this yields the values
\begin{align}
\label{C1e:DefFinValue}
V_N(x,\beta,f) & \bydef \sup_{\pi \in \PS^R } v_N(x,\pi,\beta,f) \, , \\
\label{C1e:DefDiscValue}
V(x) = V(x,\beta) & \bydef \sup_{\pi \in \PS^R } v ( x , \pi , \beta ) \, , \\
\label{C1e:DefAvValue}
W(x) & \bydef \sup_{\pi \in \PS^R } w ( x , \pi ) \, .
\end{align}
For $ \epsilon \ge 0 $, a policy $\pi$ is called $ \epsilon $-optimal for
criterion $g$ if
$ g ( x , \pi ) \ge G (x) - \epsilon $ for all $x \in \X$.
A $0$-optimal policy is called optimal.
For a function $f$ on $\X$, we consider
the reward operators:
\begin{align}
\label{C1e:TransOper}
P^a f(x) & \bydef \E [ f(x_1) \mid x_0 = x, a_0 = a ] \, , \\
\label{C1e:DiscOper}
T^a_\beta f (x) & \bydef r (x,a) + \beta P^a f (x) \,
\end{align}
and
the
optimality operators:
\begin{align}
\label{C1e:TransOptOper}
P f (x) & \bydef \sup_{a \in \A(x)} P^a f (x) \, , \\
\label{C1e:DiscOptOper}
T_\beta f(x) & \bydef \sup_{a \in \A(x)} T^a_\beta f(x) \, .
\end{align}
The finite horizon Optimality Equation\index{optimality
equation|idef}\index{optimality equation!finite horizon|idef} is
\begin{equation}
\label{C1e:FinHorOptEq}
V_{N+1} (x) = T_\beta V_N(x), \ \ \ \ \ \ \ \ \ N=0,1,\ldots \, .
\end{equation}
The discounted reward Optimality Equation\index{optimality
equation|idef}\index{optimality equation!discounted reward|idef} is
\begin{equation}
\label{C1e:DiscOptEq}
V (x) = T_\beta V(x) \, .
\end{equation}
An action $a\in A(x)$ is called conserving at state $x$ for the $(N+1)$-step problem
if $T^a_\beta V_N(x)= T_\beta V_N(x).$ An action $a\in A(x)$ is called conserving at state $x$
for the total discounted reward
if $T^a_\beta V(x)= T_\beta V(x).$
When $ \beta = 1 $ we denote $ T^a = T^a_1 $ and $ T = T_1 $. In particular,
\begin{equation}
\label{C1e:TotOptEq}
V (x) = TV(x)
\end{equation}
is the Optimality Equation for expected total undiscounted rewards.
For total reward criteria, value functions usually satisfy the optimality
equation.
In addition, the sets of conserving $n$-step actions, $n=1,\ldots,N+1$
form the sets of optimal actions for $(N+1)$-step problems.
Under some additional conditions, the sets of conserving actions form the sets
of optimal actions for infinite horizon problems.
We shall consider these results in appropriate chapters.
The average reward Optimality
Equations\index{optimality equation!average reward|idef} are
\begin{align}
\label{C1e:AvOptEq1}
W (x) & = P W (x) \\
\label{C1e:AvOptEq2}
W (x) + h (x) & = \sup_{a\in \A^\prime (x)} T^a h (x) \, , \\
\intertext{where}
\A^\prime (x) & = \left \{ a \in \A (x) : P^a W (x) = P W (x) \right \} \, .
\end{align}
Equation~(\ref{C1e:AvOptEq1}) is called the First Optimality Equation and
\index{optimality equation!first}equation~(\ref{C1e:AvOptEq2})
is called the Second Optimality Equation.\index{optimality equation!second}
Note that if $W(x) = W $, a constant, then
the First Optimality Equation holds and $ \A^\prime (x) = \A (x) $.
In this case, the Second Optimality Equations transforms into
\begin{equation}
\label{C1e:OptEq2}
W + h(x)= T h(x)
\end{equation}
which is often referred
to %a2
simply as the Optimality Equation for average rewards.
We allow for the starting point $x$ to be defined by an initial probability
distribution $\mu$. In this case, we keep the above notation and definitions
but we replace the initial state $x$ with the initial distribution $\mu$. For example, we use
$\PR_\mu^\pi$, $\E_\mu^\pi$, $v(\mu,\pi)$, $V(\mu),$ $w(\mu,\pi),$ and $W(\mu)$.
We remark that, generally speaking, optimality and $\epsilon$-optimality with
respect to all initial distributions are stronger than the defined above
optimality and $\epsilon$-optimality with
respect to all initial states.
However, in many natural cases these definitions are equivalent.
For example, it is true for total reward criteria.
%
A more general problem arises when there are multiple objectives. Suppose
there are $(K+1)$ reward functions $r_k(x,a), $ $k=0,\ldots,K$.
For finite horizon problems,
terminal rewards may also depend on $k$. In this case, we index by
$k=0,\ldots,K$ all functions
that describe rewards. For example, we use the notation
$w_k(x,\pi)$, $f_k(x)$, and $W_k(x)$.
For problems with multiple criteria, it is usually natural to fix an
initial state $x$.
It is also possible to fix an initial distribution $\mu$, with our
convention that
all definitions remain the same, but we write $\mu$ instead of $x$. %a
So, for simplicity,
we define optimal policies when the initial state $x$ (not a distribution)
is fixed.
If the performance\index{performance} of a policy $\pi$ is evaluated
by $(K+1)$ criteria
$g_k (x,\pi)$ then one goal may be to optimize
criterion $g_0$ subject to constraints on
$g_1,\ldots,g_K.$ Let $C_k,$ $k=1,\ldots,K,$ be given numbers.
We say that a policy $\pi$ is
feasible\index{feasible|idef}\index{policy!feasible|idef} if
\begin{equation}
\label{C1e:DefFeasible}
g_k (x , \pi ) \ge C_k , \qquad k = 1 , \ldots , K \, .
\end{equation}
A policy $\pi$ is called optimal\index{optimal|idef}\index{constrained
optimization!optimal|idef} for a constrained optimization problem
if it is feasible and
\begin{equation}
\label{C1e:DefOptConstr}
g_0 (x , \pi) \ge g_0 (x , \sigma) \quad
\text{for any feasible policy
$\sigma$} .
\end{equation}
%
\paragraph{Nondiscrete MDPs: general constructions.}
% {\bf Nondiscrete MDPs: general constructions.}
When a state space $\X$ or
an action space $\A$ are is not discrete, the natural assumption is that they
are measurable spaces endowed with $\sigma$-fields $\cX$ and $\cA$ respectively.
When $\X$ or $\A$ are discrete, the corresponding $\sigma$-field is the set
of all subsets of the corresponding set.
It is also natural to assume that the sets
$\A(x) \in \cA$
of feasible actions are measurable,
for all states $x\in\X$. Of course,
this assumption always holds when $\A$ is discrete.
Unless we specify otherwise, we always consider a Borel $\sigma$-field
${\cal B}(\R )$ on $\R $: this is the minimal $\sigma$ field
containing all intervals.
For non-discrete MDPs, we also assume that $r$ is a measurable function on
$(\X\times\A, \cX\times\cA )$ and $p(Y|x,a)$ is a regular transition
probability from $(\X\times\A, \cX\times\cA )$ to $(\X, \cX )$.
Recall that given two measurable spaces
$(E_1,{\cal E}_1)$ and $ (E_2,{\cal E}_2)$,
we call $p$ a regular transition probability from $E_1$ to $E_2$
if the following two conditions hold:
(i) $ p(\cdot|e_2) $ is a probability measure on $(E_1,{\cal E}_1)$
for any $e_2\in E_2$, and
(ii) the function $p(B | \cdot )$ is measurable on $E_2$ for any
$B\in {\cal E}_1$.
In order to define policies in the general situation, we consider
$\sigma$-fields ${\cal H}_n = \cX \times (\cA \times \cX)^{n}$ on the sets of
histories $H_n = \X \times (\A \times \X)^{n}$.
Nonrandomized and randomized strategies are defined in a way similar
to discrete MDPs, with standard and natural additional measurability conditions:
(a) nonrandomized policies $\pi$ are defined by mappings
$\pi_n$ which are measurable on $(H_n,{\cal H}_n)$, and
(b) stationary and Markov policies are defined by mappings which are
measurable on $\X\times\A.$ Similarly,
for randomized policies, $\pi_n$ are regular transition probabilities
from $(H_n,{\cal H}_n)$ to $(\A,\cA)$ and, for randomized Markov and stationary
policies, they are regular transition probabilities from
$(\X\times\A,\cX\times\cA)$ to $(\A,\cA)$.
Let ${\cal H}_\infty=(\cX\times\cA)^\infty.$ Ionescu Tulcea theorem,
Neveu~\cite[Section 5.1]{ne}, implies that any initial state $x$ and policy
$\pi$ define a unique probability measure on $(H_\infty,{\cal H}_\infty)$.
We denote this measure by $\PR_x^\pi$. Sometimes it is called the ``strategic" measure.
We denote by $\E_x^\pi$ expectations with respect to this measure.
We also notice that Ionescu Tulcea theorem implies that $\PR_x^\pi$ is a regular
transition probability from $(\X,\cX)$ to $(H_\infty,{\cal H}_\infty)$ and
this implies that the functions $v_n(x,\pi,\beta,f)$ and $v(x,\pi,\beta)$ are
measurable in $x$ for any policy $\pi$ (the terminal function $f$ is also
assumed to be measurable).
We remark that we use Ionescu Tulcea theorem instead of better known
Kolmogorov's extension theorem primarily because the latter requires additional
assumptions about the structure of the state space (it has to be Borel) and
the first one has no such structural assumptions.
At the intuitive level, randomized decisions are more general than nonrandomized
ones; this means that any nonrandomized policy belongs to the class of
randomized policies. In addition, in order to avoid trivial situation, an
MDP has to have at least one policy. In order to guarantee these two intuitive
properties, we always assume the following two mild conditions:
(i) all one-point sets $\{a\}$ are elements of $\cA,$ $a\in\A$;
(ii) there is at least one measurable function $\phi$ from $\X$ to $\A$
such that $\phi(x)\in\A(x)$ for all $x\in \X$
The first assumption always holds for models with discrete state and action
spaces. The second assumption always holds for models with discrete state spaces.
For a measure $ \nu $ and a measurable function $f$ we use the equivalent notations
\begin{equation}
\nu ( f ) \bydef \int f ( \alpha ) \, d\nu (\alpha ) \bydef f ( \nu ) \, .
\end{equation}
If we denote $\pi_x(\cdot)=\pi(\cdot|x)$ for a randomized stationary
policy $\pi$ then, similarly to discrete MDPs, this policy defines a
Markov chain with transition probabilities $p(dy|x,\pi_x).$ If $\X$ is
discrete, this chain has transition matrix $P(\pi)$ with elements $p_{xy}(\pi_x)$.
Thus, an MDP, strategies, and objective functions can be defined under very
general conditions. However, very little can be done if one tries to analyze
MDPs with arbitrary measurable state spaces. The first complication is that the value functions $V$ may not
be measurable even for one-step models. The second complication is that an
important step in the analysis of MDPs is to construct an equivalent randomized
Markov policy for an arbitrary policy; see Chapter ???. This can be done by
constructing regular transition probabilities $\PR_x^\pi(da_n|x_n)$ which may
not exist for general state and action spaces. These two complications do
not exist if the state space is countable. These two complications can be
resolved if $\X$ and $\A$ Borel spaces. In addition, at the current state
of our knowledge, there is no clear need to consider MDPs with arbitrary state measurable spaces
because there is no clear motivation or practical needs for such objects.
For example, MDPs with Borel state spaces have applications to
statistics, control of models with incomplete information, and to inventory management. However, for example, we are not aware of
possible applications of MDPs with state spaces having higher cardinality
than continuum.
\paragraph{Discrete state MDPs.}
In this case, the state space $\X$ is discrete and
the action space is a measurable space $(\A,\cA)$ such that all one-point
sets are measurable. The sets of feasible actions $\A(x)$ are also elements of
$\cA.$ Reward functions $r(x,a)$ and transition probabilities $p(y|x,a)$ are
automatically measurable in $a$. All constructions described for discrete and
general MDPs go through with $\cX$ being the $\sigma$-field of all subsets of $\X.$
\paragraph{Classical Borel MDPs.} Though we do not follow any particular text,
all definitions, constructions, and statements, related to Borel spaces we
mention in this chapter can be found in Bertsekas and Shreve~\cite[Chapter 7]{bs};
see also Dynkin and Yushkevich~\cite{dy} and Kechris~\cite{ke}.
Two measurable spaces
$(E_1, {\cal E}_1)$ and $(E_2, {\cal E}_2)$ are called isomorphic if there is a
one-to-one measurable mapping $f$ of $(E_1, {\cal E}_1)$ onto
$(E_2, {\cal E}_2)$ such that $f^{-1}$ is measurable.
%
A Polish space is a complete separable metric space. Unless we specify
otherwise, we always consider a Borel $\sigma$-fields ${\cal B}(E)$ on a metric
space $E$; ${\cal B}(E)$ is the minimal $\sigma$-field containing all open
subsets of $E$.
Of course, any measurable subset $E^\prime$ of a Polish space forms a
Polish space endowed with the Borel $\sigma$-field which is the
intersection of $E^\prime$ with Borel subsets of the original space.
A measurable space $(E, {\cal E})$ is called Borel if it is isomorphic to a
Polish space. All Borel spaces are either finite or countable or continuum, and
two Borel spaces with the same cardinality are isomorphic. Therefore,
uncountable Borel spaces are continuum.
They are also isomorphic to each other and to the sets
$(\R ,{\cal B}(\R ))$ and $([0,1],{\cal B}([0,1]))$.
The assumptions for Borel MDPs are:
(i) $\X$ and $\A$ are Borel spaces and $\cX$ and $\cA$ are corresponding Borel $\sigma$-fields;
(ii) the graph
$${\rm Gr}\ \A(x)=\{(x,a)|\ x\in\X, a\in\A(x)\}$$
is a measurable subset
of $\X\times\A$ and there exists at least one measurable mapping
$\phi$ of $\X$ into $\A$ such that $\phi(x)\in\A(x)$ for all $x\in\A(x)$;
(iii) the reward functions $r(x,a)$ are measurable on $\X\times\A$ and the
transition probabilities $p(\cdot|x,a)$ are regular transition probabilities
from $\X\times\A$ to $\X.$
Conditions (i) and (iii) are similar to the corresponding assumptions for
general models. The measurability of the graph in (ii) implies that the sets
$\A(x)$ are measurable. The existence of a measurable mapping (often called a
``selector") implies that $\A(x)\ne \emptyset$ for all $x$.
We remark that it is possible that the graph is Borel and all images
are non-empty but the graph does not contain a Borel mapping.
Therefore, the second assumption in (ii) is essential for the existence
of at least one policy.
As was discussed above, the first real complication is that even for one-step
problems, the values $V$ may not be Borel measurable functions on $\X$.
However, conditions (i)-(iii) imply that these functions are universally
measurable for finite and infinite-horizon problems and therefore optimality
operators can be defined.
Here we explain the concepts of universally measurable sets and functions.
Let $(E,{\cal E})$ be a Borel space. For a given
probability measure $p$ on $(E,{\cal E})$, define a $\sigma$-field
${\cal E}_p$ which is a completion
of ${\cal E}$ with respect to measure $p$. That is, ${\cal E}_p$ is the minimal
$\sigma$-field that contains
${\cal E}$ and all subsets $F$ of $E$ such that
$F\subset F^\prime$ for some $F^\prime\in {\cal E}$,
and $p(F^\prime)=0$.
For example, if $(E,{\cal E})=([0,1],{\cal B}([0,1]))$
then we can consider the Lebesgue measure $m$ defined by $m([a,b])=|b-a|.$
Then ${\cal E}_m$ is the so-called Lebesgue
$\sigma$-field. Let ${\bf P}(E)$ be the set of all probability measures on $E$.
Then the intersection
of all $\sigma$-fields ${\cal E}_p$,
${\cal U}(E)=\cap_{\{p\in {\bf P}(E)\}} {\cal E}_p$, is a $\sigma$-field and it
is called the universal $\sigma$-field. This $\sigma$-field is also called the
$\sigma$-field of universally measurable sets and its elements are called
universally measurable subsets of $E$. A universally measurable function
on $X$ is a measurable mapping from $(X,{\cal U}(X))$ to $(\R ,{\cal B}(\R )$,
where ${\cal U}(X)$ is a universal $\sigma$-field on $X$. Of course, any Borel
set and any Borel function are universally measurable.
Thus, optimality equations can be defined for Borel MDPs. However, there is
another complication for Borel models, which is annoying mostly for
aesthetic reasons: $\epsilon$-optimal policies may not exist
for small positive $\epsilon$,
even for one-step Borel MDPs with bounded reward functions.
The example constructed by David
Blackwell is based on the observation that the value function
is universally measurable but it may not be Borel.
However, for any policy, the expected one-step reward is a Borel function of
the initial step. Moreover, it is possible to show that for the Borel MDP
described above, for any initial
measure $p$ on $\X$, and for any $\epsilon>0$ there exists a policy which
is $p-{\rm a.s.}$
$\epsilon$-optimal. Such policies are called $(p,\epsilon)$-optimal.
\paragraph{Universally measurable Borel MDPs.} If we expand the set of policies and consider universally measurable policies, $\epsilon$-optimal policies exist and
the concept of $(p,\epsilon)$ optimality is not needed. However, if we expand
the set of policies, the results and their proofs hold for assumptions which are broader than (ii) and (iii).
Before we give formal definitions, we explain the concept of analytic sets.
Let $f$ is a measurable mapping of a Borel space $(E_1,{\cal E}_1)$ into Borel
x space $(E,{\cal E})$. If
$F\in {\cal E}$ then by definition
$f^{-1}(F)\in {\cal E}_1$. However, it is possible that
$f(E)\notin {\cal E}$
for some Borel set $F\in {\cal E}_1$. A subset $F$ of a Borel space
$(E,{\cal E})$ is called analytic if
there exists a Borel space $(E_1,{\cal E}_1)$ and a measurable mapping of
$E_1$ to $E$ such that
$F=f(F_1)$ for some $F_1\in{\cal E}_1$.
Since one can select $E_1=E$ and $f(e)=e$, every Borel set is analytic.
It is also possible to show that any analytic set is universally measurable.
It is also possible to consider the $\sigma$-field of analytically measurable
sets which is the smallest $\sigma$-field containing all analytic subsets of an
analytic set. We remark that Borel and universally measurable $\sigma$-fields
consist respectively of Borel and universally measurable sets. The situation is
different for analytic sets and $\sigma$-fields of analytically measurable sets.
The complement of an analytic set may not be analytic. Therefore, the
$\sigma$-field of analytically measurable sets contains sets other than
analytic. We remark that there are many equivalent definitions of analytic
sets. For example, for Polish spaces they can be defined as continuous images
or even as projections of Borel sets.
If $(E,{\cal E})$ and $(E_1,{\cal E}_1)$ are two Borel spaces (Borel sets with
Borel $\sigma$-fields) then the mapping $f:E\to E_1$ is called universally
(analytically) measurable if $f^{-1}(B)$ belongs to the $\sigma$-field of
universally (analytically) measurable subsets of $E.$
The assumptions for universally measurable MDPs are:
(a) The state and action spaces $(\X,\cX)$ and $(\A,\cA)$ are Borel spaces;
(b) ${\rm Gr}\ \A(x)$ is an analytic subset of $\X\times\A$ and all sets
$\A(x)$ are not empty;
(c) The reward function $r(x,a)$ is an upper analytic function on $\X\times\A$,
that is, for any real number $c$, the set
$\{r\ge c\}$ is an analytic subset of $\X\times\A;$
(d) The transition function $p(\cdot |x,a)$ is a regular transition probability
from $(\X\times\A,\cX\times\cA)$to $(\X,\cX)$.
Assumptions (a) and (d) coincide with similar assumptions for Borel MDPs.
According to Jankov--von Neumann theorem, assumption (b) implies that there is
an analytically measurable mapping $\phi$ from $\X$ to $\A$ such that
$\phi(x)\in\A(x)$ for all $x\in\X.$ Of course, any analytically measurable
mapping is universally measurable. Assumption (c) is more general than the
assumption that $r(x,a)$ is Borel. This generality is unimportant. It is kept
in the literature just because the same proofs holds for upper analytic and
Borel reward functions.
The last important difference between Borel and universally measurable MDPs is
that policies are universally measurable for the latter ones. Nonrandomized
policies are universally measurable mappings $\phi_n$ of $H_n$ to $\A$ such
that $\phi(h_n)\in \A(x_n)$ for any $h_n=x_0a_n\ldots x_n\in H_n.$ Markov
(and stationary) policies are defined by universally measurable mappings
$\phi_n$ of $\X$ to $\A$ such that $\phi_n(x)\in\A(x)$ ($\phi(x)\in\A(x)$)
for all $x\in\X.$
Randomized, randomized Markov, and randomized stationary policies are regular
transition probabilities defined in the same way as for general models but the
sets $H_n$ and $\X$ are endowed with $\sigma$-fields of universally measurable
subsets that play the role of $\sigma$-field ${\cal E}_1$ in the definition of
regular transition probabilities given above. Condition (b) implies that there
exists at least one policy.
There are other versions of universally measurable MDPs. For example, one can
consider analytically measurable policies; see Bertsekas and Shreve \cite{bs}
for details. The important feature is that all definitions and notations,
given for discrete MDPs, hold also for universally measurable MDPs.
\section{What's in this volume}
To be written.
%
%EF
\begin{thebibliography}{999}
%
\bibitem{bs}[BertShr]
D.~P.~Bertsekas and S.~E.~Shreve,``Stochastic Optimal Control: The Discrete-Time Case," Academic Press, New York, 1978 (republished by Athena Scientific, 1997).
%
\bibitem{chung}[Chung]
K.~L.~Chung, ``Markov Chains with Stationary Transition Probabilities,"
Springer-Verlag, Berlin, 1960.
%
\bibitem{dy}[DynYsh]
E.~B.~Dynkin and A.~A.~Yushkevich, ``Controlled Markov Processes,"
Springer-Verlag, New York, 1979 (translation from 1975 Russian edition).
%
\bibitem{ke}[Kechris]
A.~S.~Kechris, ``Classical Descriptive Set Theory," Springer-Verlag, New York, 1995.
%
\bibitem{ne}[Ne]
J.~Neveu, ``Mathematical Foundations of the Calculus of Probability,"
Holden-Day, San Francisco, 1965.
\end{thebibliography}
%ef
\printindex
\end{document}
%
% END OF FILE