The k-Armed Bandit Problem and Action Values
Chapter 2 — Multi-Armed Bandits
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 25–36
The k-Armed Bandit Problem
You repeatedly face a choice among k different options (actions). After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the selected action. Your objective: maximize total reward over some time period T.
Named after a slot machine (“one-armed bandit”); k machines = k-armed bandit.
Why study this? Bandits capture the exploration-exploitation tradeoff in its purest form — no state transitions, just repeated action selection. Understanding bandits provides foundation for full RL.
Formal Definition
k actions: a ∈ {1, 2, ..., k}
True value of action a: q*(a) = E[R_t | A_t = a]
Optimal value: q*(a*) = max_a q*(a)
Optimal action: a* = argmax_a q*(a)
The agent does not know q*(a) — must estimate from experience.
Action-Value Estimates
Sample-Mean Estimator
Q_t(a) = (sum of rewards when a taken before t) / (number of times a taken before t)
= [Σ_{i=1}^{t-1} R_i · 1[A_i = a]] / [Σ_{i=1}^{t-1} 1[A_i = a]]
By the law of large numbers: Q_t(a) → q*(a) as number of pulls of a → ∞.
Notation: Q_t(a) is the estimate at time t; q*(a) is the true (unknown) value.
Regret Framework
Regret = opportunity loss from not always playing the optimal action:
Regret_T = Σ_{t=1}^{T} [q*(a*) - q*(A_t)]
= Σ_a Δ_a · E[N_T(a)]
where: - Δ_a = q(a) - q*(a) = “gap” for action a (0 for optimal) - N_T(a) = number of times action a selected by time T
Goal: minimize regret ≡ maximize total reward.
Lower bound (Lai & Robbins, 1985):
lim_{T→∞} E[N_T(a)] / log(T) ≥ 1 / KL(q(a) || q(a*))
Any consistent algorithm must pull suboptimal arms at least O(log T) times. This is unavoidable.
Exploration vs. Exploitation: The Core Tension
- Exploit: choose A_t = argmax_a Q_t(a) (greedy) — maximize immediate expected reward
- Explore: choose a suboptimal A_t — reduce uncertainty about q*(a), potentially improving future decisions
Why you can’t always exploit: if you’ve only tried arm 1 a few times, Q_t(1) may be a poor estimate. Exploiting prematurely locks in a suboptimal arm.
Why you can’t always explore: wasting pulls on clearly inferior arms.
Balancing this tradeoff is the central theme of bandit algorithms — and all of RL.
Greedy Action Selection
A_t = argmax_a Q_t(a)
Problems: 1. If A_t is tied, need a tie-breaking rule (usually random) 2. Never explores — will converge to a local optimum if initial estimates are wrong 3. Permanently ignores actions with initially bad estimates
Example: 3-armed bandit, q(1)=1, q(2)=5, q*(3)=3. If initial estimates Q_1 = [6, 2, 4], greedy always picks arm 1 — never discovers that arm 2 is optimal.
The Nonstationary Case
In many real problems, q*(a) changes over time (e.g., network conditions in DynamICCL).
For nonstationary bandits: - Sample mean (equal weighting) is suboptimal — old observations count too much - Need recency weighting: exponential moving average (covered in file 2)
Worked Example: 10-Armed Testbed
S&B’s standard benchmark: - k = 10 actions - q(a) drawn from N(0, 1) for each problem instance - Actual reward R_t ~ N(q(A_t), 1) (Gaussian noise)
Baseline performance: - Best possible (knows q): average reward ≈ 1.55 (expected max of 10 standard normals) - Pure random: average reward ≈ 0 (mean of q(a) values) - ε-greedy performance tested in file 2
Connection to DynamICCL
In DynamICCL, the “bandit” setting arises at decision boundaries: given a fixed network state, should we exploit the current best-known NCCL configuration or explore alternatives? The k-armed bandit captures this sub-problem within the full RL formulation. Understanding bandit solutions motivates UCB (Ch.2) and posterior sampling approaches used in exploration-heavy RL deployments.