Pure Exploration: Fixed Budget
Pure Exploration: Fixed Budget
A Different Objective
All previous MAB algorithms aimed to maximize cumulative reward (minimize regret). Pure exploration problems have a different goal: identify the best arm, not maximize reward along the way.
In the fixed budget variant:
- You are given a total budget of $n$ arm pulls.
- After exhausting the budget, you must output an estimate $\hat{a}$ of the best arm $a^\ast = \arg\max_i \mu_i$.
- The metric is the error probability $p_e = \mathbb{P}{\hat{a} \neq a^\ast}$.
The aim is to minimize $p_e$ within the budget.
Algorithm 11: Uniform Exploration
The simplest approach: pull each arm equally and pick the empirically best.
Require: Number of arms $K$, budget $n$, arm distributions $\nu = {\nu_i}$
- Pull each arm $n/K$ times (assume $K \mid n$ for simplicity)
- Return $\hat{a} = \arg\max_i \hat{\mu}_i$
Error Analysis
For 1-sub-Gaussian arms, arm $i$ is incorrectly selected over arm 1 only if $\hat{\mu}_i \geq \hat{\mu}_1$. By a union bound:
\[p_e^{\text{UE}} = \sum_{i=2}^K \mathbb{P}(\hat{\mu}_i \geq \hat{\mu}_1)\] \[= \sum_{i=2}^K \mathbb{P}\!\left\{(\hat{\mu}_1 - \mu_1) + (\mu_i - \hat{\mu}_i) \leq -\Delta_i\right\}\]Splitting the deviation using union bound:
\[p_e^{\text{UE}} \leq \sum_{i=2}^K \left[\mathbb{P}\!\left(\hat{\mu}_1 - \mu_1 \leq \frac{-\Delta_i}{2}\right) + \mathbb{P}\!\left(\mu_i - \hat{\mu}_i \leq \frac{-\Delta_i}{2}\right)\right]\]Each term is bounded by $\exp(-\Delta_i^2 n / (8K))$ using Hoeffding’s inequality with $n/K$ samples. Therefore:
\[p_e^{\text{UE}} \leq 2\sum_{i=2}^K \exp\!\left\{-\frac{\Delta_i^2 n}{8K}\right\} \leq 2K\exp\!\left\{-\frac{\Delta_2^2 n}{8K}\right\}\]where $\Delta_2 = \mu_1 - \mu_2$ is the smallest sub-optimality gap. The error decays exponentially in $n$, at rate $\Delta_2^2 / (8K)$.
Key observation: Arms that are highly sub-optimal ($\Delta_i$ large) are easy to distinguish and contribute little to $p_e$. The bottleneck is the closest competitor, arm 2.
The Problem with Uniform Exploration
Uniform exploration gives every arm the same number of pulls regardless of how sub-optimal it is. If arm $K$ has $\Delta_K = 0.9$ (clearly bad), pulling it $n/K$ times is wasteful. We should concentrate on the close competition at the top.
Algorithm 12: Successive Rejects
Successive Rejects (SR) runs a tournament in phases. In each phase, every remaining arm gets some pulls, and the worst-performing arm is eliminated.
Require: Number of arms $K$, budget $n$, arm distributions $\nu = {\nu_i}$
Let $\bar{\log}(K) = \frac{1}{2} + \sum_{i=2}^K \frac{1}{i}$ and define the elimination schedule:
\[n_i = \left\lceil\frac{n - K}{\bar{\log}(K)(K - i + 1)}\right\rceil \quad \text{for } i \in [K-1]\](Total constraint: $\sum_{i=1}^{K-2} n_i + 2n_{K-1} \leq n$)
For $i = 1$ to $K-1$ do:
- Phase $i$ begins.
- Give each remaining arm $n_i - n_{i-1}$ additional pulls (so each arm has $n_i$ total pulls).
- Eliminate arm $j = \arg\min_j \hat{\mu}_j(n_i)$ (the worst-performing remaining arm).
End for
Return the sole surviving arm.
Why This Schedule?
$n_i \propto \frac{1}{K - i + 1}$: the number of pulls in phase $i$ is inversely proportional to the number of surviving arms. Early phases (many arms) get fewer pulls per arm; later phases (few arms remain) get more. This concentrates budget on the hard discrimination problem at the top.
If $n_1 = n_2 = \ldots = n_{K-1} = n/K$, this reduces to Uniform Exploration.
Error Analysis for Successive Rejects
Define error event $E_i$: the optimal arm 1 gets eliminated in phase $i$.
\[p_e^{\text{SR}} = \sum_{i=1}^{K-1} \mathbb{P}(E_i)\]Phase $i$ intuition: At phase $i$, exactly $i-1$ arms have been eliminated. So at least one of the worst $i$ arms ($K-i+1$ through $K$ in terms of rank) is still in the competition. If arm 1 gets eliminated, it lost to at least one of these worst $i$ arms.
\[\mathbb{P}(E_i) \leq \mathbb{P}\!\left(\bigcup_{j=K-i+1}^K \hat{\mu}_1(n_i) \leq \hat{\mu}_j(n_i)\right) \leq \sum_{j=K-i+1}^K \mathbb{P}(\hat{\mu}_1(n_i) \leq \hat{\mu}_j(n_i))\]Each term is bounded by $2\exp{-n_i \Delta_j^2 / 8}$. Since $\Delta_j \geq \Delta_{K-i+1}$ for $j \geq K-i+1$:
\[\mathbb{P}(E_i) \leq 2i \exp\!\left\{\frac{-n_i \Delta_{K-i+1}^2}{8}\right\}\]Summing over phases:
\[\boxed{p_e^{\text{SR}} \leq \sum_{i=1}^{K-1} 2i \exp\!\left\{-\frac{n_i \Delta_{K-i+1}^2}{8}\right\}}\]Substituting the Schedule
With the schedule $n_i = \lceil (n-K) / (\bar{\log}(K)(K-i+1))\rceil$, define:
\[H_2 = \max_{i \geq 2} \frac{i}{\Delta_i^2}\]Then:
\[p_e^{\text{SR}} \leq 2K^2 \exp\!\left\{-\frac{(n-K)}{8\bar{\log}(K) H_2}\right\}\]Comparison: Uniform Exploration vs Successive Rejects
| Uniform Exploration | Successive Rejects | |
|---|---|---|
| Pulls per arm | Equal ($n/K$) | Adaptive (more on close arms) |
| Error decay rate | $\Delta_2^2 / (8K)$ | $1 / (8\bar{\log}(K) H_2)$ |
| Key complexity | $\Delta_2$ (closest competitor) | $H_2 = \max_i i/\Delta_i^2$ |
SR has a faster error decay rate in most instances (when the $H_2$ complexity is small). The quantity $H_2$ captures the effective hardness of the best-arm identification problem.
The ratio $H_2 = \max_i i / \Delta_i^2$ penalizes instances where many arms have small gaps. If all gaps are equal ($\Delta_i = \Delta$ for all $i$), then $H_2 = K/\Delta^2$, comparable to the uniform exploration regime.