Epsilon-Greedy
Epsilon-Greedy
The Simplest Adaptive Exploration Strategy
UCB achieves adaptive exploration through optimistic indices. Epsilon-Greedy takes a more direct approach: with probability $\epsilon_t$, explore uniformly at random; with probability $1 - \epsilon_t$, exploit the current best arm.
Algorithm 10: Epsilon-Greedy
Require: Number of arms $K$, horizon $n$, bias $\epsilon_0$
Initialize: Empirical mean of each arm $= 0$.
For $t = 1$ to $n$ do:
\[A_t = \begin{cases} \text{Uniform}([K]) & \text{w.p. } \epsilon_t \\ \arg\max_{i} \hat{\mu}_i & \text{w.p. } 1 - \epsilon_t \end{cases}\]Decay $\epsilon_t = \frac{1}{t}$
End for
The exploration probability $\epsilon_t = 1/t$ decays over time. Early on, the agent explores frequently. As estimates sharpen, exploitation dominates. The $1/t$ decay ensures each arm is explored infinitely often (almost surely), which is necessary for consistent estimation.
Intuition and Trade-offs
Why decay $\epsilon_t$? A fixed $\epsilon$ means the agent always wastes a constant fraction of pulls on random arms, giving linear regret. Decaying $\epsilon_t$ ensures the exploration cost is eventually negligible.
Why $1/t$ specifically? It is the boundary rate: slow enough to explore each arm infinitely often, but fast enough that the exploration cost ($\sum_t \epsilon_t$) diverges slowly (as $\ln n$).
Compared to UCB:
- UCB explores implicitly through uncertainty estimates; high-uncertainty arms get high indices automatically.
- $\epsilon$-greedy explores explicitly and uniformly among all arms, wasting pulls on clearly sub-optimal arms.
- UCB has provably logarithmic regret with matching constants. $\epsilon$-greedy requires more careful tuning to match UCB’s guarantees.
Regret Analysis (Sketch)
Performing a full regret analysis for $\epsilon$-greedy with $\epsilon_t = \epsilon_0/t$ (a more general form) involves:
-
Exploration cost: At each step $t$, with probability $\epsilon_t$ a random arm is pulled. The expected cost of random pulls is $\epsilon_t \cdot \bar{\Delta}$ where $\bar{\Delta}$ is the average sub-optimality gap. Total: $\sum_{t=1}^n \epsilon_t \cdot \bar{\Delta}$.
-
Exploitation cost: With probability $1 - \epsilon_t$, the empirically best arm is pulled. If estimates are good, this is the right arm. The probability of picking the wrong arm decays as $t$ grows.
The optimal $\epsilon_0$ depends on the instance ($\Delta$ values) and horizon $n$, making $\epsilon$-greedy harder to tune in practice.
Practical Considerations
$\epsilon$-greedy is simple to implement: no confidence interval computation, just a coin flip. This makes it popular in practice despite its weaker theoretical guarantees.
It handles non-stationarity better than UCB in some empirical settings, since the uniform exploration prevents the algorithm from getting stuck.
The decay schedule matters: $\epsilon_t = c/t$ for well-chosen $c$ can achieve logarithmic regret. The regret analysis parallels the ETC analysis but with the exploration interleaved throughout.
Relationship to Other Algorithms
| Algorithm | Exploration mechanism | Exploitation | Regret |
|---|---|---|---|
| ETC | Fixed upfront (first $mK$ steps) | Pure (last $n - mK$ steps) | $O(\ln n / \Delta)$ with known $\Delta$ |
| UCB | Implicit (high UCB $=$ high uncertainty) | Greedy on UCB index | $O(\ln n / \Delta)$ instance-independent |
| $\epsilon$-Greedy | Explicit random (w.p. $\epsilon_t$) | Greedy on empirical mean | $O(\ln n / \Delta)$ with tuned $\epsilon_0$ |
All three achieve logarithmic regret in the stochastic setting, but UCB does so with the weakest assumptions and no tuning on the instance parameters.