EXP3: Adversarial Multi-Armed Bandits

From Expert Advice to Bandits

In the expert advice setting, after choosing an action the agent sees the loss of every expert. This full information made exploitation straightforward.

In the bandit setting, the agent only observes the loss of the arm it chose. All other arms remain hidden. This forces a genuine exploration-exploitation tradeoff: to learn which arm is best, you must pull arms you might not otherwise want to pull.

The adversarial MAB assumes the loss sequence is generated by an oblivious adversary, just like before. There are $K$ arms, and at each step:

Agent picks arm $A_t \in [K]$
Agent observes only $Y_{A_t, t}$ (loss of chosen arm)
Losses $Y_{j,t}$ for $j \neq A_t$ remain hidden

The Estimation Problem

Since we only observe the chosen arm, we need to estimate the losses of unchosen arms. Define the importance-weighted estimator:

\[\hat Y_{j,t} = \frac{\mathbf{1}_{A_t = j} \cdot Y_t}{p_{j,t}}\]

where $p_{j,t} = \mathbb P[A_t = j \mid I_{t-1}]$ is the probability of choosing arm $j$ at time $t$.

This estimator is unbiased:

\[\mathbb E_t[\hat Y_{j,t} \mid I_{t-1}] = \frac{p_{j,t} \cdot Y_{j,t}}{p_{j,t}} + \sum_{i \neq j} \frac{p_{i,t} \cdot 0}{p_{j,t}} = Y_{j,t}\]

The issue: the variance can be very large when $p_{j,t}$ is small (rarely explored arms get high-variance estimates).

Loss-Based Estimator

Working with losses instead of rewards, define $Z_t = 1 - Y_t$ and $\hat Z_t = 1 - \hat Y_t$. The cumulative loss estimate for expert $i$ up to time $t$:

\[\hat L_{i,t} = \sum_{s=1}^t \hat Z_{i,s}\]

Unbiasedness: $\mathbb E[\hat L_{i,t}] = L_{i,t}$ (follows from linearity and the unbiased estimator above).

Algorithm 5: EXP3

EXP3 stands for Exponential algorithm for Exploration and Exploitation.

Require: Number of arms $K$, horizon $n$, learning rate $\eta \in (0, 1)$

Initialize: $\hat L_{j,0} = 0$ for all $j$

For $t = 1$ to $n$ do:

For $j = 1$ to $K$: compute

\[p_{j,t} = \frac{\exp(\eta \cdot \hat L_{j,t-1})}{\sum_{i=1}^K \exp(\eta \cdot \hat L_{i,t-1})}\]

Sample $A_t \sim (p_{1,t}, \ldots, p_{K,t})$
Observe $Y_{A_t, t}$
For $j = 1$ to $K$: update $\hat L_{j,t} = \hat L_{j,t-1} + \hat Y_{j,t}$

End for

EXP3 is exactly Algorithm 4 (Randomized EWMA) but using the estimated loss $\hat{L}$ instead of the true loss $L$ at each step.

Regret Bound

Theorem 2.1. The expected regret of EXP3 satisfies:

\[\mathbb{E}\!\left[R_n^{\text{EXP3}}\right] \leq \frac{\ln K}{\eta} + \eta n K\]

Setting $\eta = \sqrt{\frac{\ln K}{nK}}$:

\[\mathbb{E}\!\left[R_n^{\text{EXP3}}\right] \leq 2\sqrt{nK \ln K}\]

Compare this to REWMA’s bound of $\sqrt{\frac{n \ln K}{2}}$. EXP3 pays an extra factor of $\sqrt{K}$ due to the information loss from partial feedback.

Proof

We track the weight potential $w_t = e^{\eta n} \sum_{i=1}^K e^{-\eta \hat L_{i,t}}$.

Lower bound. For any expert $i$:

\[w_n \geq e^{\eta n} e^{-\eta \hat L_{i,n}}\]

Upper bound. Analyze the ratio $w_t / w_{t-1}$:

\[\frac{w_t}{w_{t-1}} = e^{\eta} \sum_{j=1}^K p_{j,t} e^{\eta(1 - \hat Y_{j,t})}\]

For $x \leq 1$, we have $1 + x \leq e^x \leq 1 + x + x^2$. Applying this with $x = \eta(1 - \hat Y_{j,t})$:

\[\frac{w_t}{w_{t-1}} \leq \exp\!\left\{\eta \sum_{j=1}^K p_{j,t}(1 - \hat Y_{j,t}) + \eta^2 \sum_{j=1}^K p_{j,t}(1 - \hat Y_{j,t})^2\right\}\]

Telescoping over $n$ steps and applying Lemma A.1 (which bounds $\mathbb{E}\sum p_{j,t}(1-\hat Y_{j,t})^2 \leq Kn$):

\[\frac{w_n}{K} \leq \exp\!\left\{\eta \sum_{t,j} p_{j,t}(1-\hat Y_{j,t}) + \eta^2 Kn\right\}\]

Combining with the lower bound and simplifying, using $\sum_j p_{j,t}(1-\hat Y_{j,t}) = n - \hat L_n$ and $n - \hat L_{i,n} \leq n - \hat L_n + \hat L_n - \hat L_{i,n}$:

\[\hat L_n - \hat L_{i,n} \leq \frac{\ln K}{\eta} + \eta Kn\]

Taking expectation (using $\mathbb E[\hat L_{i,n}] = L_{i,n}$ and $\mathbb E[\hat L_n] = \mathbb E[L_n]$):

\[\mathbb{E}\!\left[R_{i,n}^{\text{EXP3}}\right] = \hat L_n - \hat L_{i,n} \leq \frac{\ln K}{\eta} + \eta Kn \quad \square\]

Key Observations

The extra $\sqrt{K}$ factor is unavoidable. The minimax regret for adversarial bandits is $\Theta(\sqrt{nK})$, whereas for full-information experts it is $\Theta(\sqrt{n \ln K})$. The bandit setting is fundamentally harder.

EXP3 is not anytime (it requires knowing $n$ to set $\eta$). An anytime version can be obtained by decaying $\eta_t$.

No high-probability guarantee comes for free in the adversarial bandit case, since we need the estimated losses to concentrate. This requires more careful analysis than in the expert setting.