Policy Iteration

The Idea: Improve Directly on the Policy

Value Iteration works on the value function space. Policy Iteration works directly in policy space: start with any policy, evaluate it, improve it greedily, and repeat. Each iteration produces a strictly better policy until convergence.

Greedy Policy Improvement

Given a stationary policy $f$, define the one-step improvement $f^\ast$:

\[f^\ast(x) = \begin{cases} \arg\min_{a \in A}\!\left[c_i(a) + \alpha \sum_{j \in S} p_{i,j}(a) V_f(j)\right] & \text{if } x = i \\ f(x) & \text{otherwise} \end{cases}\]

The policy $f^\ast$ applies the optimal action at state $i$ (for a single chosen state) and follows $f$ everywhere else.

Lemma 5.6. $V_{f^\ast} \leq V_f$ (component-wise).

Proof. Apply the Bellman operator $B_{f^\ast}$:

\[(B_{f^\ast}(V_f))(i) = c_i(f^\ast(i)) + \alpha \sum_j p_{ij}(f^\ast(i)) V_f(j)\]

By definition of $f^\ast$, the action $f^\ast(i)$ minimizes $[c_i(a) + \alpha\sum_j p_{ij}(a) V_f(j)]$, so:

\[\leq c_i(f(i)) + \alpha\sum_j p_{ij}(f(i)) V_f(j) = (B_f(V_f))(i) = V_f(i)\]

So $B_{f^\ast}(V_f) \leq V_f$. Since $B_{f^\ast}$ is a contraction:

\[V_{f^\ast} = \lim_{n \to \infty} B_{f^\ast}^n(V_f) \leq V_f \quad \square\]

The core insight: improving the policy myopically at a single state always results in a globally better policy.

Algorithm 16: Policy Iteration (PI)

Require: MDP $M = (S, A, P, C, \alpha)$, initial policy $\pi_0$

Define:

$\text{IS}(\pi_0) = {s \in S : Q_{\pi_0}(s, \pi_0(s)) > \min_{a \in A} Q_{\pi_0}(s, a)}$ (improvable states)
$\text{IA}(\pi_0, s) = {a \in A : Q_{\pi_0}(s, a) < V_{\pi_0}(s)}$ (improvable actions at state $s$)

While $\text{IS} \neq \emptyset$ do:

For $s \in \text{IS}(\pi_0)$:
- Pick action $a’ \in \text{IA}(\pi_0, s)$
- $\pi_0(s) = a’$
Re-calculate $\text{IS}(\pi_0)$ and $\text{IA}(\pi_0, s)$ for all $s \in S$

End while

Return $\pi_0$

At each round, the algorithm identifies states where the current policy is suboptimal (IS set) and updates the policy to take a better action at those states.

Policy Improvement Theorem

Theorem 5.7. Consider MDP $M = (S, A, P, C, \alpha)$ and let $\Pi$ be the set of all stationary policies. Then:

If $\text{IS}(\pi) = \emptyset$, then $\pi$ is the optimal policy for $M$.
If $\pi’$ is obtained after one round of PI on $\pi$, then $V_{\pi’} \leq V_\pi$ (component-wise, with strict inequality if $\pi$ is not optimal).

Proof of Part 1. If $\text{IS}(\pi) = \emptyset$, then for all $s \in S$, $\pi(s)$ is already minimizing the Q-function. So:

\[\forall \pi' \in \Pi, \quad V^\pi \preceq B^{\pi'}[V^\pi]\]

Applying $B^{\pi’}$ repeatedly:

\[V^\pi \preceq B^{\pi'}[V^\pi] \preceq (B^{\pi'})^2[V^\pi] \preceq \ldots \preceq \lim_{l \to \infty} (B^{\pi'})^l [V^\pi] = V^{\pi'}\]

So $V^\pi \leq V^{\pi’}$ for all $\pi’$, meaning $\pi$ is optimal. $\square$

Proof of Part 2. After one PI round, we have $\pi’$ with $B^{\pi’}[V^\pi] \preceq V^\pi$ (from Lemma 5.6 applied to each improved state). Then:

\[V^{\pi'} = \lim_{l \to \infty} (B^{\pi'})^l[V^\pi] \preceq V^\pi \quad \square\]

Convergence of Policy Iteration

Since the state space $S$ and action space $A$ are finite, there are finitely many deterministic stationary policies ($\lvert A\rvert ^{\lvert S\rvert }$ total). Each PI round produces a strictly better policy (unless we have already reached the optimum). Therefore, PI terminates in at most $\lvert A\rvert ^{\lvert S\rvert }$ steps.

In practice, PI converges much faster: often in $O(\lvert S\rvert )$ iterations.

Variants of Policy Iteration

Howard’s Policy Iteration: Improves every improvable state simultaneously. Each round may change many states’ actions at once.

Simple Policy Iteration: Changes only one improvable state per round. Slower in practice but easier to analyze.

Modified Policy Improvement (M.S. Random PI): Randomly selects which improvable state to update. Useful for theoretical analysis.

Practical Example: Machine Replacement

Consider a machine with degradation states $S = {0, 1, 2, \ldots}$. At each state $i$:

Keep the machine (action 0): incur maintenance cost $L(i)$, transition to the next degradation state probabilistically.
Replace the machine (action 1): incur cost $R + L(0)$ (purchase price plus maintenance at state 0), and restart at state 0.

The cost structure: $C(i, 1) = R + L(0) \quad \text{(replace)} \quad C(i, 0) = L(i) \quad \text{(keep)}$

The Bellman equation is:

\[V^\ast(i) = \min\!\left\{L(i) + \alpha \sum_j P_{ij} V^\ast(j), \; R + L(0) + \sum_j P_{0j} V^\ast(j)\right\}\]

Claim 5.8. For an increasing function $h(\cdot)$, $\sum_{j=k}^\infty P_{ij} h(j)$ is increasing in $i$.

Lemma 5.9. The optimal value function $V^\ast(i)$ is increasing in $i$.

Proof by induction using Value Iteration: $V_1(i) = \min(R + L(0), L(i))$ is increasing (since $L$ is increasing). The Bellman update preserves monotonicity at each step, so $V^\ast$ is increasing.

Optimal policy structure: Since $L(i)$ increases and $R + L(0) + \sum_j P_{0j} V^\ast(j)$ is constant in $i$, there exists a threshold $i^\ast$ such that:

\[i^\ast = \max\!\left\{i : L(i) + \alpha \sum_j P_{ij} V^\ast(j) \leq R + L(0) + \sum_j P_{0j} V^\ast(j)\right\}\]

The optimal policy is: keep if $i \leq i^\ast$, replace if $i > i^\ast$. This is a threshold policy: once degradation crosses the threshold, replacement is optimal.