Online Learning and Optimization (EE6106)
The central question throughout: can a learning agent compete with the best fixed strategy in hindsight, and how quickly? The answer depends on what the agent observes and what the adversary knows.
Part 1: Prediction via Expert Advice
The agent observes $K$ experts at each step and must predict a sequence against an adversary. The challenge is combining expert advice optimally without knowing which expert is best.
| Lecture | Topic | Date |
|---|---|---|
| 1 | Majority Algorithm | 7 Jan 2026 |
| 2 | Weighted Majority Algorithm | 9 Jan 2026 |
| 3 | Exponential WMA | 14 Jan 2026 |
| 4 | Randomized Exponential WMA | 16 Jan 2026 |
Key result: Sub-linear regret $O(\sqrt{n \ln K})$ is achievable with convex loss or a randomized algorithm. A deterministic algorithm against a fully adaptive adversary with $0$-$1$ loss cannot escape linear regret.
Part 2: Multi-Armed Bandits
The agent pulls one arm per step and observes only its reward. Partial feedback forces a genuine exploration-exploitation tradeoff.
Adversarial Bandits
| Lecture | Topic | Date |
|---|---|---|
| 5 | EXP3: Adversarial Bandits | 21 Jan 2026 |
Stochastic Bandits
| Lecture | Topic | Date |
|---|---|---|
| 6 | Setup and Full-Info Setting | 23 Jan 2026 |
| 7 | Explore-then-Commit (ETC) | 28 Jan 2026 |
| 8 | Upper Confidence Bound (UCB) | 30 Jan 2026 |
| 9 | Anytime UCB | 4 Feb 2026 |
| 10 | Epsilon-Greedy | 11 Feb 2026 |
Lower Bounds and Pure Exploration
| Lecture | Topic | Date |
|---|---|---|
| 11 | Information-Theoretic Lower Bounds | 13, 18 Feb 2026 |
| 12 | Pure Exploration: Fixed Budget | 11 Mar 2026 |
| 13 | Pure Exploration: Fixed Confidence | 13 Mar 2026 |
Key results: Logarithmic regret $O(\ln n / \Delta)$ is achievable and optimal for stochastic MABs. Adversarial bandits pay an extra $\sqrt{K}$ factor. Minimax regret is $\Omega(\sqrt{nK})$.
Part 3: Markov Decision Processes
Actions affect future states. The agent must plan, not just react. When the model is unknown, this becomes reinforcement learning.
| Lecture | Topic | Date |
|---|---|---|
| 14 | MDP Basics and Value Functions | 20 Mar 2026 |
| 15 | Bellman Equations | 25 Mar 2026 |
| 16 | Bellman Operators and Value Iteration | 27 Mar 2026 |
| 17 | LP Approach and Q-Functions | 1 Apr 2026 |
| 18 | Policy Iteration | 3 Apr 2026 |
| 19 | Expected Time-Average MDPs | 8 Apr 2026 |
| 20 | Reinforcement Learning | 10 Apr 2026 |
Key results: Discounted MDPs always have an optimal stationary deterministic policy. Value Iteration and Policy Iteration converge via the Bellman contraction. The undiscounted (ETA) case requires additional structural conditions (irreducibility, bounded hitting times).
Course Arc
| Setting | Feedback | Adversary | Best achievable regret |
|---|---|---|---|
| Expert Advice | Full | Oblivious | $O(\sqrt{n \ln K})$ |
| Adversarial MAB | Partial | Oblivious | $O(\sqrt{nK \ln K})$ |
| Stochastic MAB | Partial | Stochastic | $O(\ln n / \Delta)$ |
| MDP | Full state | Stochastic | Planning (Value/Policy Iteration) |
| RL | Full state | Stochastic, unknown | Exploration + planning |