Online Learning and Optimization (EE6106)

The central question throughout: can a learning agent compete with the best fixed strategy in hindsight, and how quickly? The answer depends on what the agent observes and what the adversary knows.


Part 1: Prediction via Expert Advice

The agent observes $K$ experts at each step and must predict a sequence against an adversary. The challenge is combining expert advice optimally without knowing which expert is best.

Lecture Topic Date
1 Majority Algorithm 7 Jan 2026
2 Weighted Majority Algorithm 9 Jan 2026
3 Exponential WMA 14 Jan 2026
4 Randomized Exponential WMA 16 Jan 2026

Key result: Sub-linear regret $O(\sqrt{n \ln K})$ is achievable with convex loss or a randomized algorithm. A deterministic algorithm against a fully adaptive adversary with $0$-$1$ loss cannot escape linear regret.


Part 2: Multi-Armed Bandits

The agent pulls one arm per step and observes only its reward. Partial feedback forces a genuine exploration-exploitation tradeoff.

Adversarial Bandits

Lecture Topic Date
5 EXP3: Adversarial Bandits 21 Jan 2026

Stochastic Bandits

Lecture Topic Date
6 Setup and Full-Info Setting 23 Jan 2026
7 Explore-then-Commit (ETC) 28 Jan 2026
8 Upper Confidence Bound (UCB) 30 Jan 2026
9 Anytime UCB 4 Feb 2026
10 Epsilon-Greedy 11 Feb 2026

Lower Bounds and Pure Exploration

Lecture Topic Date
11 Information-Theoretic Lower Bounds 13, 18 Feb 2026
12 Pure Exploration: Fixed Budget 11 Mar 2026
13 Pure Exploration: Fixed Confidence 13 Mar 2026

Key results: Logarithmic regret $O(\ln n / \Delta)$ is achievable and optimal for stochastic MABs. Adversarial bandits pay an extra $\sqrt{K}$ factor. Minimax regret is $\Omega(\sqrt{nK})$.


Part 3: Markov Decision Processes

Actions affect future states. The agent must plan, not just react. When the model is unknown, this becomes reinforcement learning.

Lecture Topic Date
14 MDP Basics and Value Functions 20 Mar 2026
15 Bellman Equations 25 Mar 2026
16 Bellman Operators and Value Iteration 27 Mar 2026
17 LP Approach and Q-Functions 1 Apr 2026
18 Policy Iteration 3 Apr 2026
19 Expected Time-Average MDPs 8 Apr 2026
20 Reinforcement Learning 10 Apr 2026

Key results: Discounted MDPs always have an optimal stationary deterministic policy. Value Iteration and Policy Iteration converge via the Bellman contraction. The undiscounted (ETA) case requires additional structural conditions (irreducibility, bounded hitting times).


Course Arc

Setting Feedback Adversary Best achievable regret
Expert Advice Full Oblivious $O(\sqrt{n \ln K})$
Adversarial MAB Partial Oblivious $O(\sqrt{nK \ln K})$
Stochastic MAB Partial Stochastic $O(\ln n / \Delta)$
MDP Full state Stochastic Planning (Value/Policy Iteration)
RL Full state Stochastic, unknown Exploration + planning