Gradient Bandit Algorithms

Chapter 2 — Multi-Armed Bandits

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 37–41


Motivation: Learning Preferences, Not Values

All previous methods (ε-greedy, UCB) estimate action values Q(a) ≈ q*(a).

Alternative: learn action preferences H_t(a) — not directly interpretable as values, but used to compute a probability distribution over actions.

Why preferences? - Sometimes knowing relative preference is enough (don’t need absolute values) - Gradient ascent on expected reward provides a principled learning rule - Connects to policy gradient methods (Ch.13 preview)


Softmax Action Selection

Convert preferences H_t(a) to action probabilities via softmax:

π_t(a) = P(A_t = a) = exp(H_t(a)) / Σ_{b=1}^{k} exp(H_t(b))

Properties: - π_t(a) ∈ (0, 1) for all a (all actions have positive probability) - Σ_a π_t(a) = 1 - Large H_t(a) → higher probability; small H_t(a) → lower probability - Only relative differences matter: H_t(a) + c gives same probabilities for any constant c

Temperature interpretation: the standard softmax has “temperature” τ = 1. τ → 0: deterministic greedy; τ → ∞: uniform random. Gradient bandits implicitly use τ = 1 and learn H to achieve the right distribution.


Stochastic Gradient Ascent on Expected Reward

Objective: maximize expected reward J(H) = Σ_a π(a) · q*(a)

Taking the gradient with respect to H_t(a):

∂J / ∂H_t(a) = Σ_b q*(b) · ∂π_t(b)/∂H_t(a)

Key identity for softmax gradients:

∂π_t(b)/∂H_t(a) = π_t(b) [1_{a=b} - π_t(a)]

Therefore:

∂J/∂H_t(a) = Σ_b q*(b) π_t(b) [1_{a=b} - π_t(a)]
            = Σ_b [q*(b) - X] π_t(b) [1_{a=b} - π_t(a)]

for any constant X (the baseline term Σ_b π_t(b) [1_{a=b} - π_t(a)] = 0, so X cancels).

Setting X = R̄_t (a baseline):

∂J/∂H_t(a) = E[(R_t - R̄_t)(1_{A_t=a} - π_t(a))]

The Update Rule

Stochastic gradient ascent replaces the expectation with a sample:

H_{t+1}(A_t) = H_t(A_t) + α(R_t - R̄_t)(1 - π_t(A_t))    ← chosen action
H_{t+1}(a)   = H_t(a)   - α(R_t - R̄_t) π_t(a)            ← for a ≠ A_t

Intuition: - If R_t > R̄_t (reward above baseline): increase H_t(A_t) → higher probability of repeating A_t - If R_t < R̄_t (reward below baseline): decrease H_t(A_t) → lower probability of repeating A_t - Non-selected actions move in the opposite direction proportional to their current probability

Compact form

H_{t+1}(a) = H_t(a) + α(R_t - R̄_t)(1_{A_t=a} - π_t(a))   for all a

This single update handles both the selected and non-selected actions.


The Baseline R̄_t

The baseline R̄_t can be any value that doesn’t depend on A_t. Standard choice: running average of past rewards.

R̄_{t+1} = R̄_t + (1/t)(R_t - R̄_t)    ← incremental sample mean

or with constant stepsize β:

R̄_{t+1} = R̄_t + β(R_t - R̄_t)

Effect of baseline on variance: - R̄_t = 0: update is unbiased but high variance (gradient estimates are noisy) - R̄_t ≈ E[R_t]: update is unbiased and lower variance (deviation signal is centered) - Optimal baseline minimizes variance of gradient estimate

The baseline does not introduce bias — it subtracts a term whose expectation over A_t is zero (since Σ_a π_t(a)[1_{A_t=a} - π_t(a)] = 0 in expectation). This is a critical property: baselines reduce variance without affecting the direction of gradient ascent.


Pseudocode: Gradient Bandit

Initialize:
  H(a) = 0 for all a ∈ {1,...,k}    ← zero preferences
  R_bar = 0                          ← baseline estimate
  t = 0

Loop:
  t ← t + 1
  Compute π(a) = exp(H(a)) / Σ_b exp(H(b))   for all a
  Select A_t ~ π (sample from distribution)
  Observe reward R_t

  R_bar ← R_bar + (1/t)(R_t - R_bar)          ← update baseline

  H(A_t) ← H(A_t) + α(R_t - R_bar)(1 - π(A_t))
  For all a ≠ A_t:
    H(a) ← H(a) - α(R_t - R_bar) π(a)

Performance and Comparison

On 10-Armed Testbed (q*(a) ~ N(4, 1)):

Method Notes
Gradient bandit, α=0.1, with baseline Excellent — finds near-optimal quickly
Gradient bandit, α=0.1, no baseline Poor — baseline critical when rewards nonzero
Gradient bandit, α=0.4, with baseline Good but more noisy

Key insight: without a baseline, if q*(a) > 0 for all actions (as in this case), gradient updates are dominated by the absolute reward level rather than relative performance. The baseline removes this bias and makes learning problem-scale-invariant.


Convergence Analysis

The gradient bandit performs stochastic gradient ascent on:

J(H) = E_π[R_t] = Σ_a π(a) q*(a)

With appropriate stepsizes (Robbins-Monro conditions), H_t converges to a local maximum of J. For a unimodal bandit (one clearly best arm), this is the global optimum.

Policy gradient theorem (previewed here, proven in Ch.13): The update rule for H is an unbiased estimate of ∇J. This generalizes to full MDPs as REINFORCE.


Connection to Policy Gradient Methods (Ch.13 Preview)

The gradient bandit is a simplified version of the REINFORCE algorithm:

Bandit Full RL (Ch.13)
H_t(a) log π(a|s; θ)
π_t(a) = softmax(H) π(a|s; θ) = softmax(linear(s))
Update ∝ (R - R̄)(1_{A=a} - π_t(a)) Update ∝ (G_t - b(s_t)) ∇ log π(A_t|s_t; θ)
R̄: baseline reduces variance b(s_t) = V(s_t): state-value baseline

The bandit case (no state, single step) isolates the policy gradient mechanism from the credit assignment complexity of full RL.


Connection to DynamICCL

DynamICCL’s PPO algorithm uses the same softmax policy + policy gradient mechanism. Understanding gradient bandits provides intuition for why baselines (value functions in actor-critic) are essential — they reduce variance in policy gradient estimates. In the NCCL tuning context, the “preference” H_t corresponds to the policy logits over NCCL algorithm choices.