Reinforcement Learning

The Unknown Model Problem

MDPs assumed full knowledge of $P$ (transition matrix) and $C$ (cost matrix). In reinforcement learning (RL), only $S$ and $A$ are known. The agent must learn the optimal policy by interacting with the environment and observing outcomes.

This is the hardest setting: the model itself must be discovered through exploration.

Two Axes of RL Algorithms

Learning paradigm:

Off-policy: The learning algorithm uses historical data (state, action, cost tuples) without actively choosing the policy during learning. Batch learning.
On-policy: The learning algorithm explores using some policy and refines it as data arrives. Online learning.

Model knowledge:

Model-based: Learn estimates $\hat{P}, \hat{C}$ of the transition and cost matrices, then use standard MDP planning (Value Iteration / Policy Iteration).
Model-free: Never explicitly learn $P$ or $C$. Directly learn the value function or Q-function.

Monte Carlo Approach (Model-Based, Off-Policy)

Given historical trajectories, estimate $P$ and $C$ using the Law of Large Numbers:

\[\hat p_{i,j}(a) = \frac{\sum \mathbf{1}\{X_t = i, A_t = a, X_{t+1} = j\}}{\sum \mathbf{1}\{X_t = i, A_t = a\}}\] \[\hat c_{i,j}(a) = \frac{\sum C(X_t, A_t) \cdot \mathbf{1}\{X_t = i, A_t = a, X_{t+1} = j\}}{\sum \mathbf{1}\{X_t = i, A_t = a\}}\]

Then run Value Iteration or Policy Iteration on the estimated MDP $(\hat{P}, \hat{C})$.

Assumptions needed:

Finite state and action spaces.
Irreducible MDP (every state reachable from every other state).

For on-policy Monte Carlo: Use an $\varepsilon_t$-optimal policy at each step (explore with probability $\varepsilon_t = 1/t$, exploit with $1 - \varepsilon_t$). This decaying schedule ensures each state-action pair is explored infinitely often almost surely.

Q-Learning (Model-Free, Off-Policy)

Q-learning directly estimates $Q^\ast$ without learning $P$ or $C$. The update rule:

\[\hat{Q}(X_t, A_t) \leftarrow \hat{Q}(X_t, A_t) + a_t\!\left[C_t + \alpha \min_{a'} \hat{Q}(X_{t+1}, a') - \hat{Q}(X_t, A_t)\right]\]

where $a_t$ is a step size (learning rate). The term in brackets is the temporal difference (TD) error: the difference between the target $C_t + \alpha \min_{a’} \hat{Q}(X_{t+1}, a’)$ and the current estimate $\hat{Q}(X_t, A_t)$.

Intuition: The update pushes $\hat{Q}(X_t, A_t)$ toward the Bellman target $C_t + \alpha \min_{a’} \hat{Q}(X_{t+1}, a’)$. As data accumulates, $\hat{Q}$ converges to $Q^\ast$.

Convergence: Q-learning converges to $Q^\ast$ almost surely if:

Every state-action pair is visited infinitely often.
Step sizes satisfy $\sum_t a_t = \infty$ and $\sum_t a_t^2 < \infty$ (e.g., $a_t = 1/t$).

SARSA (Model-Free, On-Policy)

SARSA is Q-learning’s on-policy counterpart. The name comes from the quintuple $(X_t, A_t, C_t, X_{t+1}, A_{t+1})$ used in the update:

\[\hat{Q}(X_t, A_t) \leftarrow \hat{Q}(X_t, A_t) + a_t\!\left[C_t + \alpha \hat{Q}(X_{t+1}, A_{t+1}) - \hat{Q}(X_t, A_t)\right]\]

The key difference from Q-learning: instead of using $\min_{a’} \hat{Q}(X_{t+1}, a’)$ (the greedy action), SARSA uses the actual next action $A_{t+1}$ taken by the current policy.

Action selection uses $\varepsilon$-greedy:

\[A_t = \begin{cases} \arg\min_{a'} \hat{Q}(X_t, a') & \text{w.p. } 1 - \varepsilon_t \\ \text{Random} & \text{w.p. } \varepsilon_t \end{cases}\]

SARSA learns the Q-function of the behavior policy (including its exploration). Q-learning learns $Q^\ast$ regardless of the behavior policy.

Deep Q-Learning (Neural Function Approximation)

For large or continuous state spaces, storing a full Q-table is infeasible. Deep Q-learning uses a neural network with parameters $\theta$ to approximate $Q^\ast$:

\[Q_\theta(x, a) \approx Q^\ast(x, a)\]

Training procedure:

Collect transitions $(X_t, A_t, C_t, X_{t+1})$ into a replay buffer (a mini-batch $B$).
Define the target: $(X_t, A_t) \mapsto C_t + \alpha \min_{a’} Q_{\theta}(X_{t+1}, a’)$.
Minimize the squared loss:

\[\mathcal{L}(\theta) = \sum_B \!\left[C_t + \alpha \min_{a'} Q_\theta(X_{t+1}, a') - Q_\theta(X_t, A_t)\right]^2\]

The instability problem: Both the input $Q_\theta(X_t, A_t)$ and the target $\min_{a’} Q_\theta(X_{t+1}, a’)$ are parameterized by the same $\theta$. Updating $\theta$ moves both simultaneously, causing instability (the target is a moving goalposts).

Two-time scale approximation (Target Network): Maintain two networks: the online network with parameters $\theta$ (updated frequently) and a target network with parameters $\tilde{\theta}$ (updated infrequently):

\[\mathcal{L}(\theta) = \sum_B \!\left[C_t + \alpha \min_{a'} Q_{\tilde{\theta}}(X_{t+1}, a') - Q_\theta(X_t, A_t)\right]^2\]

$\theta$ updates rapidly via gradient descent. $\tilde{\theta}$ is updated by copying from $\theta$ every $k$ steps. This decouples the target from the current estimate, stabilizing training.

TD Learning (Value Function Approximation)

Temporal Difference (TD) learning aims to learn the value function $V_\pi$ for a fixed policy $\pi$, without full model knowledge.

The TD update:

\[\hat{V}(X_t) \leftarrow \hat{V}(X_t) + a_t\!\left[C_t + \alpha \hat{V}(X_{t+1}) - \hat{V}(X_t)\right]\]

This is equivalent to Q-learning but for value functions (without the min over actions, since the policy is fixed).

TD(0): The version above, using only the immediate next step.

TD($\lambda$): Uses $\lambda$-weighted multi-step returns, interpolating between TD(0) (myopic, uses only next step) and Monte Carlo (uses full trajectory).

Summary

Algorithm	Policy	Model	Update target
Monte Carlo	Off-policy	Model-based	Estimated $\hat{P}, \hat{C}$, then VI/PI
Q-learning	Off-policy	Model-free	$C_t + \alpha \min_{a’} \hat{Q}(X_{t+1}, a’)$
SARSA	On-policy	Model-free	$C_t + \alpha \hat{Q}(X_{t+1}, A_{t+1})$
Deep Q-learning	Off-policy	Model-free	$C_t + \alpha \min_{a’} Q_{\tilde{\theta}}(X_{t+1}, a’)$
TD learning	On-policy	Model-free	$C_t + \alpha \hat{V}(X_{t+1})$

All methods converge under appropriate conditions (sufficient exploration, decaying step sizes), but Deep Q-learning requires the additional target network stabilization for neural approximation.