Stochastic MABs: Setup and Full-Info Setting

Switching from Adversarial to Stochastic

In the adversarial setting, the loss sequence was generated by an all-knowing opponent. We now transition to a stochastic setting: each arm $i \in [K]$ has a fixed reward distribution $\nu_i$, and rewards are drawn i.i.d. from it.

The MAB instance is $\nu = [\nu_1, \ldots, \nu_K]$. At each time $t$:

Algorithm pulls arm $A_t \in [K]$
Algorithm receives reward $X_t \sim \nu_{A_t}$

The goal is to maximize expected cumulative reward over horizon $n$. This is a much more benign setting than the adversarial case.

Key Quantities

Let $\mu_i = \mathbb E_{X \sim \nu_i}[X]$ be the mean reward of arm $i$. Define:

\[\mu^\ast = \max_{i \in [K]} \mu_i \quad \text{(optimal mean reward)}\] \[\Delta_i \stackrel{\text{def}}{=} \mu^\ast - \mu_i \quad \text{(sub-optimality gap of arm } i\text{)}\]

The regret of policy $\pi$ against instance $\nu$ over horizon $n$:

\[R_n(\pi, \nu) = n\mu^\ast - \mathbb{E}\!\left[\sum_{t=1}^n X_t\right]\]

Let $T_i(n) = \sum_{t=1}^n \mathbf{1}[A_t = i]$ be the number of pulls of arm $i$ by time $n$.

Regret Decomposition

Lemma 3.1. The regret can be written as:

\[R_n(\pi, \nu) = \sum_{i \in [K]} \Delta_i \cdot \mathbb E[T_i(n)]\]

Proof. Starting from the definition:

\[R_n = \sum_{t=1}^n \mathbb E[\mu^\ast - \mu_{A_t}] = \sum_{t=1}^n \mathbb E[\Delta_{A_t}]\]

Now $\Delta_{A_t} = \sum_{i=1}^K \mathbf{1}[A_t = i] \cdot \Delta_i$, so:

\[R_n = \sum_{t=1}^n \mathbb{E}\!\left[\sum_{i=1}^K \mathbf{1}[A_t = i] \cdot \Delta_i\right] = \sum_{i=1}^K \Delta_i \cdot \mathbb{E}\!\left[\sum_{t=1}^n \mathbf{1}[A_t = i]\right] = \sum_{i=1}^K \Delta_i \cdot \mathbb E[T_i(n)] \quad \square\]

Interpretation: Regret is the weighted count of sub-optimal arm pulls, weighted by how sub-optimal each arm is. To minimize regret, pull sub-optimal arms as few times as possible.

Note: the optimal arm $i^\ast$ may not be unique. All arms with $\Delta_i = 0$ are optimal and the framework still works.

The Full-Info Setting

Before tackling the true bandit problem, consider a warmup: the full-info setting, where after choosing any arm, the agent observes the rewards of all arms. This is not a bandit problem, but it illuminates the challenge.

At time $t$, the agent has observed $t$ i.i.d. samples of each arm’s reward. Empirical mean estimation and greedy selection should work.

Algorithm 6: Full-Info Greedy

Require: Number of arms $K$, horizon $n$

Initialize: Pull any arm and observe rewards of all arms.

For $t = 2$ to $n$ do:

For $j = 1$ to $K$: compute $\hat{\mu}_j(t-1)$ (empirical mean of arm $j$ from all $t-1$ samples)
Pick $A_t \in \arg\max_{i \in [K]} \hat{\mu}_i(t-1)$

End for

Regret Analysis for Full-Info

Claim: The regret of Algorithm 6 is $O(1)$.

Assume the arm rewards are 1-sub-Gaussian (reasonable since Bernoulli rewards satisfy this and Gaussian rewards with unit variance also satisfy it).

We want to bound $\mathbb{P}(A_t = i)$ for a sub-optimal arm $i$ (with $\Delta_i > 0$). The agent picks arm $i$ only if $\hat{\mu}_i(t-1) \geq \hat{\mu}_1(t-1)$.

Lemma 3.2. $X = \hat{\mu}_1(t-1) - \hat{\mu}_i(t-1)$ is $\sqrt{\frac{2}{t-1}}$-sub-Gaussian.

Proof. Each $\hat{\mu}_j(t)$ is the average of $t$ i.i.d. 1-sub-Gaussian variables. By the sub-Gaussian scaling property, $\hat{\mu}_j(t) \sim \frac{1}{\sqrt{t}}$-subG. The difference of two independent sub-Gaussians with parameters $\sigma_1, \sigma_2$ is $\sqrt{\sigma_1^2 + \sigma_2^2}$-subG. With $\sigma_1 = \sigma_2 = 1$, we get $X \sim \sqrt{\frac{2}{t-1}}$-subG. $\square$

Now, $\mathbb E[X] = \mu_1 - \mu_i = \Delta_i$. The agent picks arm $i$ when $X \leq 0$, i.e., $X - \mathbb E[X] \leq -\Delta_i$.

Using sub-Gaussian concentration:

\[\mathbb{P}(A_t = i) \leq \mathbb{P}(X - \mathbb E[X] \leq -\Delta_i) \leq e^{-\Delta_i^2(t-1)/4}\]

Therefore:

\[\mathbb E[T_i(n)] \leq 1 + \sum_{t=2}^n \mathbb{P}(A_t = i) \leq 1 + \sum_{t=2}^n e^{-\Delta_i^2(t-1)/4} \leq 1 + \sum_{t=0}^\infty e^{-\Delta_i^2 t/4} = O(1)\]

So for the Full-Info setting, the regret is:

\[R_n = \sum_i \Delta_i \cdot \mathbb E[T_i(n)] = O(1)\]

The regret is constant – it does not grow with the horizon at all.

Why Bandits Are Harder

In the full-info setting, pure exploitation works because we get samples from every arm at every step. In the true bandit setting, we only observe the reward of the arm we pulled. To estimate $\hat{\mu}_i$, we must actually pull arm $i$, which means pulling it instead of the (potentially better) optimal arm.

This forced trade-off between exploration and exploitation is the core difficulty of MABs. The next posts address how to navigate it systematically.