The k-Armed Bandit Problem and Action Values

Chapter 2 — Multi-Armed Bandits

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 25–36

The k-Armed Bandit Problem

You repeatedly face a choice among k different options (actions). After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the selected action. Your objective: maximize total reward over some time period T.

Named after a slot machine (“one-armed bandit”); k machines = k-armed bandit.

Why study this? Bandits capture the exploration-exploitation tradeoff in its purest form — no state transitions, just repeated action selection. Understanding bandits provides foundation for full RL.

Formal Definition

k actions: a ∈ {1, 2, ..., k}
True value of action a: q*(a) = E[R_t | A_t = a]
Optimal value: q*(a*) = max_a q*(a)
Optimal action: a* = argmax_a q*(a)

The agent does not know q*(a) — must estimate from experience.

Action-Value Estimates

Sample-Mean Estimator

Q_t(a) = (sum of rewards when a taken before t) / (number of times a taken before t)
       = [Σ_{i=1}^{t-1} R_i · 1[A_i = a]] / [Σ_{i=1}^{t-1} 1[A_i = a]]

By the law of large numbers: Q_t(a) → q*(a) as number of pulls of a → ∞.

Notation: Q_t(a) is the estimate at time t; q*(a) is the true (unknown) value.

Regret Framework

Regret = opportunity loss from not always playing the optimal action:

Regret_T = Σ_{t=1}^{T} [q*(a*) - q*(A_t)]
         = Σ_a Δ_a · E[N_T(a)]

where: - Δ_a = q(a) - q*(a) = “gap” for action a (0 for optimal) - N_T(a) = number of times action a selected by time T

Goal: minimize regret ≡ maximize total reward.

Lower bound (Lai & Robbins, 1985):

lim_{T→∞} E[N_T(a)] / log(T) ≥ 1 / KL(q(a) || q(a*))

Any consistent algorithm must pull suboptimal arms at least O(log T) times. This is unavoidable.

Exploration vs. Exploitation: The Core Tension

Exploit: choose A_t = argmax_a Q_t(a) (greedy) — maximize immediate expected reward
Explore: choose a suboptimal A_t — reduce uncertainty about q*(a), potentially improving future decisions

Why you can’t always exploit: if you’ve only tried arm 1 a few times, Q_t(1) may be a poor estimate. Exploiting prematurely locks in a suboptimal arm.

Why you can’t always explore: wasting pulls on clearly inferior arms.

Balancing this tradeoff is the central theme of bandit algorithms — and all of RL.

Greedy Action Selection

A_t = argmax_a Q_t(a)

Problems: 1. If A_t is tied, need a tie-breaking rule (usually random) 2. Never explores — will converge to a local optimum if initial estimates are wrong 3. Permanently ignores actions with initially bad estimates

Example: 3-armed bandit, q(1)=1, q(2)=5, q*(3)=3. If initial estimates Q_1 = [6, 2, 4], greedy always picks arm 1 — never discovers that arm 2 is optimal.

The Nonstationary Case

In many real problems, q*(a) changes over time (e.g., network conditions in DynamICCL).

For nonstationary bandits: - Sample mean (equal weighting) is suboptimal — old observations count too much - Need recency weighting: exponential moving average (covered in file 2)

Worked Example: 10-Armed Testbed

S&B’s standard benchmark: - k = 10 actions - q(a) drawn from N(0, 1) for each problem instance - Actual reward R_t ~ N(q(A_t), 1) (Gaussian noise)

Baseline performance: - Best possible (knows q): average reward ≈ 1.55 (expected max of 10 standard normals) - Pure random: average reward ≈ 0 (mean of q(a) values) - ε-greedy performance tested in file 2

Connection to DynamICCL

In DynamICCL, the “bandit” setting arises at decision boundaries: given a fixed network state, should we exploit the current best-known NCCL configuration or explore alternatives? The k-armed bandit captures this sub-problem within the full RL formulation. Understanding bandit solutions motivates UCB (Ch.2) and posterior sampling approaches used in exploration-heavy RL deployments.