Online Learning and Optimization (EE6106)

The central question throughout: can a learning agent compete with the best fixed strategy in hindsight, and how quickly? The answer depends on what the agent observes and what the adversary knows.

Part 1: Prediction via Expert Advice

The agent observes $K$ experts at each step and must predict a sequence against an adversary. The challenge is combining expert advice optimally without knowing which expert is best.

Lecture	Topic	Date
1	Majority Algorithm	7 Jan 2026
2	Weighted Majority Algorithm	9 Jan 2026
3	Exponential WMA	14 Jan 2026
4	Randomized Exponential WMA	16 Jan 2026

Key result: Sub-linear regret $O(\sqrt{n \ln K})$ is achievable with convex loss or a randomized algorithm. A deterministic algorithm against a fully adaptive adversary with $0$-$1$ loss cannot escape linear regret.

Part 2: Multi-Armed Bandits

The agent pulls one arm per step and observes only its reward. Partial feedback forces a genuine exploration-exploitation tradeoff.

Adversarial Bandits

Lecture	Topic	Date
5	EXP3: Adversarial Bandits	21 Jan 2026

Stochastic Bandits

Lecture	Topic	Date
6	Setup and Full-Info Setting	23 Jan 2026
7	Explore-then-Commit (ETC)	28 Jan 2026
8	Upper Confidence Bound (UCB)	30 Jan 2026
9	Anytime UCB	4 Feb 2026
10	Epsilon-Greedy	11 Feb 2026

Lower Bounds and Pure Exploration

Lecture	Topic	Date
11	Information-Theoretic Lower Bounds	13, 18 Feb 2026
12	Pure Exploration: Fixed Budget	11 Mar 2026
13	Pure Exploration: Fixed Confidence	13 Mar 2026

Key results: Logarithmic regret $O(\ln n / \Delta)$ is achievable and optimal for stochastic MABs. Adversarial bandits pay an extra $\sqrt{K}$ factor. Minimax regret is $\Omega(\sqrt{nK})$.

Part 3: Markov Decision Processes

Actions affect future states. The agent must plan, not just react. When the model is unknown, this becomes reinforcement learning.

Lecture	Topic	Date
14	MDP Basics and Value Functions	20 Mar 2026
15	Bellman Equations	25 Mar 2026
16	Bellman Operators and Value Iteration	27 Mar 2026
17	LP Approach and Q-Functions	1 Apr 2026
18	Policy Iteration	3 Apr 2026
19	Expected Time-Average MDPs	8 Apr 2026
20	Reinforcement Learning	10 Apr 2026

Key results: Discounted MDPs always have an optimal stationary deterministic policy. Value Iteration and Policy Iteration converge via the Bellman contraction. The undiscounted (ETA) case requires additional structural conditions (irreducibility, bounded hitting times).

Course Arc

Setting	Feedback	Adversary	Best achievable regret
Expert Advice	Full	Oblivious	$O(\sqrt{n \ln K})$
Adversarial MAB	Partial	Oblivious	$O(\sqrt{nK \ln K})$
Stochastic MAB	Partial	Stochastic	$O(\ln n / \Delta)$
MDP	Full state	Stochastic	Planning (Value/Policy Iteration)
RL	Full state	Stochastic, unknown	Exploration + planning