Pure Exploration: Fixed Confidence
Pure Exploration: Fixed Confidence
A Different Trade-off
In the fixed budget setting, the metric was error probability given a fixed pull count. The fixed confidence setting flips this: given a target confidence level, the algorithm decides adaptively when to stop and what the metric is expected stopping time $\mathbb E[\tau]$.
The algorithm outputs a triple: a sampling rule (which arm to pull), a stopping rule ($\tau$: when to stop), and a decision rule ($\hat{a}$: which arm to return at stopping).
Delta-Probably Correct
Definition. An algorithm is $\delta$-probably correct ($\delta$-PC) if:
\[\mathbb{P}\{\tau < \infty, \hat{a} \neq a^\ast\} \leq \delta\]That is, if the algorithm ever stops, the probability it returns the wrong arm is at most $\delta$.
Note: an algorithm that never stops is trivially $\delta$-PC (the probability of stopping and being wrong is 0). So the goal is to find a $\delta$-PC algorithm with small $\mathbb E[\tau]$.
Lower Bound on Stopping Time
Theorem 4.5. For a 1-Gaussian instance with means $\mu_1 \geq \mu_2 \geq \ldots \geq \mu_K$ (arm 1 optimal), any $\delta$-PC algorithm must satisfy:
\[\mathbb E[\tau] \geq 2\ln\!\left(\frac{1}{4\delta}\right) \sum_{i=1}^K \frac{1}{\Delta_i^2}\]where $\Delta_1 = \Delta_2 = \mu_1 - \mu_2$.
Proof sketch. For each arm $a \neq 1$, construct a perturbed instance $\mu^{[a]}$ where arm $a$ becomes optimal:
\[\mu_i^{[1]} = \begin{cases} \mu_2 - \epsilon & i = 1 \\ \mu_i & i \neq 1 \end{cases} \quad \mu_i^{[a]} = \begin{cases} \mu_a + \Delta_a + \epsilon & i = a \\ \mu_i & i \neq a \end{cases}\]For each $a$, define event $A = {\tau < \infty, \hat{a} \neq a^\ast_{\mu^{[a]}}}$.
Since the algorithm is $\delta$-PC, $\mathbb P_{\mu^{[a]}}(A) \leq \delta$.
Also, $\mathbb P_\mu(A^C) \leq \delta$ because in the original instance, returning arm $a$ as optimal would be wrong.
Applying BH inequality:
\[4\delta \geq \mathbb P_\mu(A^C) + \mathbb P_{\mu^{[a]}}(A) \geq \frac{1}{2}\exp\!\left\{-\mathbb E_\mu[N_a(\tau)] \cdot \text{KL}(\mu, \mu^{[a]})\right\}\]For 1-Gaussian arms with perturbation by $\epsilon \to 0$: $\text{KL}(\mu, \mu^{[a]}) \approx \Delta_a^2/2$.
So $\mathbb E_\mu[N_a(\tau)] \geq \frac{2\ln(1/(4\delta))}{\Delta_a^2}$.
Summing over all arms: $\mathbb E[\tau] = \sum_a \mathbb E[N_a(\tau)] \geq 2\ln!\left(\frac{1}{4\delta}\right)\sum_a \frac{1}{\Delta_a^2}$. $\square$
Algorithm 13: Action Elimination
Action Elimination samples arms in round-robin and eliminates arms whose UCB falls below the LCB of another arm.
Require: $K$, $\delta$, $\mu = {\mu_i}$
Initialize: Active set $A = [K]$, $t = 0$
For $t = 1, 2, \ldots$ do:
- Pull each arm in $A$ once
- For arm $a = 1$ to $K$: compute
\(\text{UCB}_a(t) = \hat\mu_a^t + \sqrt{\frac{2}{t}\ln\!\left(\frac{2Kt^2C}{\delta}\right)}, \quad \text{LCB}_a(t) = \hat\mu_a^t - \sqrt{\frac{2}{t}\ln\!\left(\frac{2Kt^2C}{\delta}\right)}\)
- Elimination set $E = {i \in A : \exists\, j \in A \text{ s.t. } \text{UCB}_i(t) \leq \text{LCB}_j(t)}$
- $A \leftarrow A \setminus E$
- If $\lvert A\rvert = 1$: return the sole arm and STOP
- Else $t \leftarrow t + 1$
End for
Algorithm 13 is $\delta$-PC
Claim 4.6. Algorithm 13 is $\delta$-PC.
Proof. Define the bad event $B_i(t) = {\lvert \hat{\mu}_i(t) - \mu_i\rvert > \alpha_t}$ where $\alpha_t = \sqrt{\frac{2}{t}\ln\frac{2Kt^2C}{\delta}}$.
\[\mathbb{P}(B_i(t)) \leq 2\exp\!\left\{\frac{-t\alpha_t^2}{2}\right\} = \frac{\delta}{Kt^2C}\]Taking a union bound over all arms and all time steps:
\[\mathbb{P}(B) = \mathbb{P}\!\left(\bigcup_{i \in [K]} \bigcup_{t=1}^\infty B_i(t)\right) \leq \frac{\delta}{C}\sum_{t=1}^\infty \frac{1}{t^2} \leq \delta \quad \text{for } C \geq \pi^2/6\]On $B^C$ (good event), all confidence intervals are valid. No correct arm gets eliminated since its LCB is above the UCB of no other arm. $\square$
Stopping Time Analysis
Arm $i$ gets eliminated when $\text{UCB}_i(t) \leq \text{LCB}_j(t)$ for some arm $j$. We know $\text{UCB}_i(t) - \text{LCB}_i(t) = 2\alpha_t$.
Once $\alpha_t \leq \Delta_i/4$, arm $i$ is guaranteed to be eliminated. This requires $T_i \sim O(1/\Delta_i^2)$ pulls. Total stopping time:
\[\tau \leq \sum_{i=1}^K T_i = O\!\left(\sum_i \frac{1}{\Delta_i^2} \ln\!\left[\frac{K}{\delta \Delta_i}\right]\right)\]Problem with Action Elimination: If multiple sub-optimal arms have the same distribution, their confidence intervals always overlap. The arm that knocked out the optimal arm might get eliminated, leaving only arms that cannot be distinguished from each other, and the algorithm never terminates. This is why LUCB is preferred.
Algorithm 14: LUCB
LUCB (Lower and Upper Confidence Bounds) addresses the termination problem by comparing only the top-two arms rather than all arms.
Define arm-dependent confidence width: $\alpha_{i,t} = \sqrt{\frac{1}{2N_i(t-1)}\ln!\left(\frac{cKt^4}{\delta}\right)}$
Require: $K$, $\delta$
Initialize: Active set $A = [K]$
For $t = 1, 2, \ldots$ do:
- Pull each arm once
- Define top arm: $a_\text{top} = \arg\max_{i \in [K]} \hat\mu_i(t)$
- Define competitor: $a_\text{comp} = \arg\max_{i \in [K],\, i \neq a_\text{top}} \hat\mu_i(t) + \alpha_{i,t}$
- Pull $a_\text{top}$ and $a_\text{comp}$ once each
- If $\text{LCB}{a\text{top}} > \text{UCB}{a\text{comp}}$: return $a_\text{top}$ and STOP
End for
Claims (Algorithm 14):
- Claim 4.7: Algorithm 14 stops with probability 1 (no termination failure).
- Claim 4.7: Algorithm 14 is $\delta$-PC.
Why LUCB Solves the Termination Problem
LUCB never eliminates arms. Instead, it identifies the best arm by asking: is the LCB of the top arm above the UCB of all competitors? If two arms have identical distributions, the top arm fluctuates between them, but eventually one will separate enough to stop the algorithm with probability 1.
Stopping Time Bound
Claim 4.8. For a variation of LUCB with $\alpha_{i,t} = \sqrt{\frac{2}{N_i(t-1)}\ln!\left(\frac{2KN_i^2(t-1)c}{\delta}\right)}$, arm $i$ is BAD if its confidence interval misbehaves. The algorithm stops when no arm is BAD:
\[\tau \leq \sum_{t=1}^\infty \sum_{i=1}^K \mathbf{1}\!\left\{(a_{\text{top}} = i \text{ or } a_{\text{comp}} = i) \text{ and } i \text{ is BAD}\right\} = \sum_{i=1}^K \tilde{T}_i\]Each $\tilde{T}_i$ is bounded by the number of pulls needed for arm $i$’s confidence interval to be valid with high probability.
Comparison of Fixed-Confidence Algorithms
| Algorithm | Termination guarantee | $\delta$-PC | Stopping time |
|---|---|---|---|
| Action Elimination | May not terminate (duplicate sub-optimal arms) | Yes | $O(\sum_i \frac{1}{\Delta_i^2} \ln \frac{K}{\delta\Delta_i})$ |
| LUCB | Always stops (w.p. 1) | Yes | Comparable to lower bound |
Lower bound: $\mathbb E[\tau] \geq 2\ln(1/(4\delta)) \sum_i 1/\Delta_i^2$
LUCB is near-optimal (matching the lower bound up to log factors), terminates with probability 1, and avoids the degenerate cases that trap Action Elimination.