Linear Programming Approach and Q-Functions

The LP Formulation of MDPs

The Bellman optimality equations characterize $V^\ast$ as the fixed point of $B^\ast$. There is an equivalent view: $V^\ast$ is the solution to a linear program.

Key Lemma

Lemma 5.5. Consider the Bellman optimality operator $B^\ast$ with:

\[B^\ast(V(i)) = \min_{a \in A}\!\left[c_i(a) + \alpha \sum_{j \in S} p_{i,j}(a) \cdot F(j)\right]\]

If $(B^\ast(V))(i) \geq V(i)$ for all $i \in S$, then $V^\ast \geq V$ component-wise.

Proof. Suppose $B^\ast(V) \geq V$. Then applying $B^\ast$ repeatedly (which preserves the inequality since $B^\ast$ is monotone):

\[V \leq B^\ast(V) \leq (B^\ast)^2(V) \leq \ldots \leq \lim_{n \to \infty} (B^\ast)^n(V) = V^\ast\]

The limit is $V^\ast$ because $B^\ast$ is a contraction converging to the unique fixed point $V^\ast$. $\square$

The Linear Program

From Lemma 5.5, $V^\ast$ is the largest function $V$ satisfying $B^\ast(V) \geq V$ component-wise. That is:

\[V^\ast = \max \sum_{i \in S} V(i) \quad \text{s.t.} \quad V(i) \leq (B^\ast(V))(i) \quad \forall i \in S\]

Expanding $(B^\ast(V))(i) = \min_{a \in A}[\ldots]$: the constraint $V(i) \leq \min_a [\ldots]$ is equivalent to $V(i) \leq c_i(a) + \alpha \sum_j p_{ij}(a) V(j)$ for all $a \in A$.

\[\boxed{V^\ast = \max \sum_{i \in S} V(i) \quad \text{s.t.} \quad V(i) \leq c_i(a) + \alpha \sum_{j \in S} p_{ij}(a) V(j) \quad \forall i \in S, a \in A}\]

This is a linear program: linear objective, linear constraints. It has $\lvert S\rvert $ variables and $\lvert S\rvert \cdot \lvert A\rvert $ constraints.

Complexity: LP can be solved in polynomial time. For small MDPs, this gives an exact $V^\ast$ in one shot (no iteration needed). For large state spaces, the $\lvert S\rvert \cdot \lvert A\rvert $ constraint count becomes the bottleneck.

The State-Action Value Function (Q-Function)

The value function $V_\pi(i)$ gives the expected cost starting from state $i$ under policy $\pi$. The Q-function extends this to include the first action explicitly.

Definition (Q-Function). For policy $\pi$, the Q-function $Q_\pi: S \times A \to \mathbb R$ is:

\[Q_\pi(i, a) = c_i(a) + \alpha \sum_{j \in S} p_{ij}(a) V_\pi(j)\]

Interpretation: $Q_\pi(i, a)$ is the expected discounted cost if the agent starts at state $i$, takes action $a$ first, then follows policy $\pi$ thereafter.

Relationship to value function:

\[Q_\pi(i, \pi(i)) = V_\pi(i)\]

When the first action matches the policy, the Q-function reduces to the value function.

Optimal Q-Function

For the optimal policy:

\[Q^\ast(i, a) = c_i(a) + \alpha \sum_j p_{ij}(a) V^\ast(j)\]

The optimal value function satisfies:

\[V^\ast(i) = \min_a Q^\ast(i, a)\]

And the optimal policy can be read off directly:

\[\pi^\ast(i) = \arg\min_{a \in A} Q^\ast(i, a)\]

Why the Q-Function Matters

In reinforcement learning (where $P$ and $C$ are unknown), the Q-function is more directly useful than $V^\ast$:

To use $V^\ast$, you need to know $P$ to compute $\arg\min_a [c(i,a) + \alpha \sum_j p_{ij}(a) V^\ast(j)]$.
To use $Q^\ast$, you simply look up $\arg\min_a Q^\ast(i, a)$, with no model required.

This is why Q-learning (covered in the RL post) learns $Q^\ast$ directly without learning $P$ or $C$ explicitly.

LP Duality and Occupancy Measures

The dual of the MDP LP has a natural interpretation in terms of state-action occupancy measures $d^\pi(s, a)$: the discounted frequency of visiting state $s$ and taking action $a$ under policy $\pi$. The dual LP optimizes over these occupancy measures and is often used for constrained MDPs and inverse reinforcement learning.