Gradient Bandit Algorithms
Chapter 2 — Multi-Armed Bandits
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 37–41
Motivation: Learning Preferences, Not Values
All previous methods (ε-greedy, UCB) estimate action values Q(a) ≈ q*(a).
Alternative: learn action preferences H_t(a) — not directly interpretable as values, but used to compute a probability distribution over actions.
Why preferences? - Sometimes knowing relative preference is enough (don’t need absolute values) - Gradient ascent on expected reward provides a principled learning rule - Connects to policy gradient methods (Ch.13 preview)
Softmax Action Selection
Convert preferences H_t(a) to action probabilities via softmax:
π_t(a) = P(A_t = a) = exp(H_t(a)) / Σ_{b=1}^{k} exp(H_t(b))
Properties: - π_t(a) ∈ (0, 1) for all a (all actions have positive probability) - Σ_a π_t(a) = 1 - Large H_t(a) → higher probability; small H_t(a) → lower probability - Only relative differences matter: H_t(a) + c gives same probabilities for any constant c
Temperature interpretation: the standard softmax has “temperature” τ = 1. τ → 0: deterministic greedy; τ → ∞: uniform random. Gradient bandits implicitly use τ = 1 and learn H to achieve the right distribution.
Stochastic Gradient Ascent on Expected Reward
Objective: maximize expected reward J(H) = Σ_a π(a) · q*(a)
Taking the gradient with respect to H_t(a):
∂J / ∂H_t(a) = Σ_b q*(b) · ∂π_t(b)/∂H_t(a)
Key identity for softmax gradients:
∂π_t(b)/∂H_t(a) = π_t(b) [1_{a=b} - π_t(a)]
Therefore:
∂J/∂H_t(a) = Σ_b q*(b) π_t(b) [1_{a=b} - π_t(a)]
= Σ_b [q*(b) - X] π_t(b) [1_{a=b} - π_t(a)]
for any constant X (the baseline term Σ_b π_t(b) [1_{a=b} - π_t(a)] = 0, so X cancels).
Setting X = R̄_t (a baseline):
∂J/∂H_t(a) = E[(R_t - R̄_t)(1_{A_t=a} - π_t(a))]
The Update Rule
Stochastic gradient ascent replaces the expectation with a sample:
H_{t+1}(A_t) = H_t(A_t) + α(R_t - R̄_t)(1 - π_t(A_t)) ← chosen action
H_{t+1}(a) = H_t(a) - α(R_t - R̄_t) π_t(a) ← for a ≠ A_t
Intuition: - If R_t > R̄_t (reward above baseline): increase H_t(A_t) → higher probability of repeating A_t - If R_t < R̄_t (reward below baseline): decrease H_t(A_t) → lower probability of repeating A_t - Non-selected actions move in the opposite direction proportional to their current probability
Compact form
H_{t+1}(a) = H_t(a) + α(R_t - R̄_t)(1_{A_t=a} - π_t(a)) for all a
This single update handles both the selected and non-selected actions.
The Baseline R̄_t
The baseline R̄_t can be any value that doesn’t depend on A_t. Standard choice: running average of past rewards.
R̄_{t+1} = R̄_t + (1/t)(R_t - R̄_t) ← incremental sample mean
or with constant stepsize β:
R̄_{t+1} = R̄_t + β(R_t - R̄_t)
Effect of baseline on variance: - R̄_t = 0: update is unbiased but high variance (gradient estimates are noisy) - R̄_t ≈ E[R_t]: update is unbiased and lower variance (deviation signal is centered) - Optimal baseline minimizes variance of gradient estimate
The baseline does not introduce bias — it subtracts a term whose expectation over A_t is zero (since Σ_a π_t(a)[1_{A_t=a} - π_t(a)] = 0 in expectation). This is a critical property: baselines reduce variance without affecting the direction of gradient ascent.
Pseudocode: Gradient Bandit
Initialize:
H(a) = 0 for all a ∈ {1,...,k} ← zero preferences
R_bar = 0 ← baseline estimate
t = 0
Loop:
t ← t + 1
Compute π(a) = exp(H(a)) / Σ_b exp(H(b)) for all a
Select A_t ~ π (sample from distribution)
Observe reward R_t
R_bar ← R_bar + (1/t)(R_t - R_bar) ← update baseline
H(A_t) ← H(A_t) + α(R_t - R_bar)(1 - π(A_t))
For all a ≠ A_t:
H(a) ← H(a) - α(R_t - R_bar) π(a)
Performance and Comparison
On 10-Armed Testbed (q*(a) ~ N(4, 1)):
| Method | Notes |
|---|---|
| Gradient bandit, α=0.1, with baseline | Excellent — finds near-optimal quickly |
| Gradient bandit, α=0.1, no baseline | Poor — baseline critical when rewards nonzero |
| Gradient bandit, α=0.4, with baseline | Good but more noisy |
Key insight: without a baseline, if q*(a) > 0 for all actions (as in this case), gradient updates are dominated by the absolute reward level rather than relative performance. The baseline removes this bias and makes learning problem-scale-invariant.
Convergence Analysis
The gradient bandit performs stochastic gradient ascent on:
J(H) = E_π[R_t] = Σ_a π(a) q*(a)
With appropriate stepsizes (Robbins-Monro conditions), H_t converges to a local maximum of J. For a unimodal bandit (one clearly best arm), this is the global optimum.
Policy gradient theorem (previewed here, proven in Ch.13): The update rule for H is an unbiased estimate of ∇J. This generalizes to full MDPs as REINFORCE.
Connection to Policy Gradient Methods (Ch.13 Preview)
The gradient bandit is a simplified version of the REINFORCE algorithm:
| Bandit | Full RL (Ch.13) |
|---|---|
| H_t(a) | log π(a|s; θ) |
| π_t(a) = softmax(H) | π(a|s; θ) = softmax(linear(s)) |
| Update ∝ (R - R̄)(1_{A=a} - π_t(a)) | Update ∝ (G_t - b(s_t)) ∇ log π(A_t|s_t; θ) |
| R̄: baseline reduces variance | b(s_t) = V(s_t): state-value baseline |
The bandit case (no state, single step) isolates the policy gradient mechanism from the credit assignment complexity of full RL.
Connection to DynamICCL
DynamICCL’s PPO algorithm uses the same softmax policy + policy gradient mechanism. Understanding gradient bandits provides intuition for why baselines (value functions in actor-critic) are essential — they reduce variance in policy gradient estimates. In the NCCL tuning context, the “preference” H_t corresponds to the policy logits over NCCL algorithm choices.