Contextual Bandits and Chapter 2 Summary

Chapter 2 — Multi-Armed Bandits

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 41–44

Associative Search (Contextual Bandits)

The k-armed bandit has no state — the optimal action is the same every time step.

Associative search (contextual bandits): different situations → different optimal actions. The agent receives a context (state) x_t before selecting an action, and must learn a mapping: context → best action.

Context x_t observed → A_t = π(x_t) → Reward R_t

Position in the RL Spectrum

                ← Stateless ─────────────────── Full RL →
k-armed bandit    Contextual bandits    Markov Decision Process
  (no state)      (state, no dynamics)   (state + dynamics)

Contextual bandits are the intermediate case: state exists (so can generalize), but actions don’t affect future states (no dynamics). The agent only has to learn what to do in each state, not the consequences of actions over time.

Examples: - Clinical trials: patient features (age, sex, biomarkers) → treatment choice - News article recommendation: user profile → article to show - Ad placement: user context → which ad to display - DynamICCL: current network state → NCCL configuration

The Full RL Distinction

Problem	State?	Actions affect future states?
k-armed bandit	No	No
Contextual bandit	Yes (context)	No
MDP (full RL)	Yes	Yes

In full RL, the agent must also learn the dynamics — how actions affect future states and thus future rewards. This requires the additional machinery of value functions and Bellman equations (Ch.3+).

Algorithms for Contextual Bandits

LinUCB (Linear UCB)

Assumes reward is linear in context: E[R | x, a] = x^T θ_a

A_t = argmax_a [x_t^T θ_a + α √(x_t^T A_a^{-1} x_t)]

where A_a = X_a^T X_a + I is the design matrix for action a. The second term is the confidence ellipsoid radius.

Achieves O(d √(T log T)) regret for d-dimensional contexts.

Thompson Sampling (Bayesian Approach)

Maintain posterior P(θ | data). At each step: 1. Sample θ̃ from posterior 2. Select A_t = argmax_a E[R | x_t, a, θ̃] 3. Observe reward, update posterior

For Gaussian rewards + Gaussian prior: posterior is also Gaussian (conjugate).

Properties: - Achieves O(√(dT log T)) regret (similar to UCB) - Often outperforms UCB empirically - Naturally handles nonstationary through prior updating

Neural Contextual Bandits (NeuralUCB, NeuralTS)

Replace linear model with neural network f(x, a; θ). Confidence bounds using: - NeuralUCB: last-layer Fisher information as confidence - NeuralTS: variational inference or Laplace approximation for posterior

These are active research areas connecting bandits to deep RL.

Chapter 2 Summary: The Four Key Methods

Method	Core Idea	Regret	Nonstationary?	Tuning
ε-greedy	Fixed random exploration	O(T)	With EWMA	ε
Optimistic init	Explore until estimates settle	O(√T)	No	Initial value
UCB	Explore uncertain actions	O(log T)	Harder	c
Gradient bandit	Gradient ascent on preferences	O(√T)	With sliding window	α

Theoretical ranking: UCB > Gradient ≈ Optimistic > ε-greedy in regret guarantees.

Practical ranking: highly problem-dependent. Thompson Sampling often best in practice.

The Exploration-Exploitation Spectrum Revisited

Pure Exploitation          Principled Exploration        Pure Exploration
      │                            │                            │
   Greedy              UCB / Thompson Sampling             Random (ε=1)
   (ε=0)              (confidence-based)
                              │
                      ε-greedy + Optimistic Init
                    (simple but widely used in RL)

Key lessons from Ch.2:

Greedy is not enough: without exploration, you converge to suboptimal actions.
Random exploration is wasteful: ε-greedy doesn’t direct exploration intelligently.
UCB principle — “optimism in the face of uncertainty” — is theoretically optimal and intuitively satisfying. The principle extends beyond bandits to all of RL.
Baselines matter: gradient bandit requires a reward baseline to reduce variance. This previews the importance of value function baselines in actor-critic methods.
Nonstationary problems require recency weighting: constant α > 1/n for environments that change over time.

Connection to Full RL

Each bandit technique has a direct RL counterpart:

Bandit (Ch.2)	Full RL Equivalent
ε-greedy action selection	ε-greedy in Q-learning, SARSA
Optimistic initialization	Optimistic Q-value initialization
UCB action selection	Count-based exploration bonuses (Ch.8+)
Gradient bandit preferences	Policy gradient / REINFORCE (Ch.13)
Baseline R̄_t	Value function baseline V(s) in A2C
Contextual bandit	Full MDP (add dynamics)

The bandit setting strips away the complexity of credit assignment across time, letting us focus purely on the exploration-exploitation tradeoff. Mastering these algorithms builds intuition that directly transfers to the full RL setting.

Connection to DynamICCL

DynamICCL uses PPO (policy gradient with value baseline): - Softmax policy over NCCL configurations: gradient bandit (this chapter) is the prototype - Value baseline: V(s) in PPO is exactly the baseline R̄ from gradient bandits, extended to state-conditioned values - Exploration: PPO uses entropy regularization — equivalent to adding a KL term that prevents the policy from becoming too greedy too quickly - Contextual structure: DynamICCL’s state (network topology, traffic) is exactly a “context” in the contextual bandit sense; PPO adds temporal credit assignment on top