ε-Greedy Action Selection and Incremental Updates
Chapter 2 — Multi-Armed Bandits
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 27–33
ε-Greedy Algorithm
Simplest effective method for balancing exploration and exploitation:
With probability (1-ε): A_t = argmax_a Q_t(a) ← exploit
With probability ε: A_t = uniform random action ← explore
Properties: - ε = 0: pure greedy (no exploration) - ε = 1: pure random (no exploitation) - ε = 0.1: explore 10% of time, exploit 90%
Convergence guarantee: for stationary distributions, ε-greedy converges to near-optimal because every action is tried infinitely often → Q_t(a) → q*(a) for all a.
Asymptotically: fraction of time optimal action chosen → (1-ε) + ε/k = 1 - ε(k-1)/k
Performance on 10-Armed Testbed
| ε | Early performance | Asymptotic performance |
|---|---|---|
| 0 (greedy) | Best initially | Worst (stuck) |
| 0.01 | Slow to find best | Nearly optimal (~99% optimal) |
| 0.1 | Finds best fastest | Suboptimal (10% wasted) |
Lesson: ε = 0.1 converges faster; ε = 0.01 achieves better asymptotic reward. Problem-dependent which is better.
Adaptive ε: Annealing
ε_t = ε_0 / t (or 1/log(t), or any schedule → 0)
Explore heavily early, exploit as confidence grows. Optimal annealing schedules exist (e.g., UCB achieves Θ(log T) regret automatically; see file 3).
Incremental Update Rule
Problem: storing all rewards and recomputing Q_t(a) = mean is O(n) memory and time.
Solution: update incrementally:
Derivation
Let Q_n denote estimate after n rewards from action a:
Q_n = (R_1 + R_2 + ... + R_{n-1}) / (n-1)
We receive a new reward R_n. The new estimate:
Q_{n+1} = (R_1 + ... + R_{n-1} + R_n) / n
= (n-1)/n · Q_n + (1/n) · R_n
= Q_n + (1/n)(R_n - Q_n)
General form:
Q_{n+1} = Q_n + α_n (R_n - Q_n)
where: - α_n = 1/n for sample mean (stepsize shrinks) - R_n - Q_n = TD error or prediction error — how much the new reward surprised us - Q_n: old estimate; Q_n + α·error: updated estimate
This is the fundamental update pattern of RL. The term (R_n - Q_n) is the target minus the current estimate.
Pseudocode: ε-Greedy Bandit
Initialize:
Q(a) = 0 for all a ∈ {1,...,k}
N(a) = 0 for all a ∈ {1,...,k}
Loop for each step t = 1, 2, ...:
With prob (1-ε): A = argmax_a Q(a) [ties broken randomly]
With prob ε: A = random action from {1,...,k}
Observe reward R from environment
N(A) ← N(A) + 1
Q(A) ← Q(A) + (1/N(A)) · (R - Q(A)) ← incremental update
Space: O(k). Time per step: O(k) for argmax.
Nonstationary Problems: Constant Stepsize
For nonstationary q*(a) (values change over time), use constant stepsize α ∈ (0, 1]:
Q_{n+1} = Q_n + α(R_n - Q_n)
= (1-α)Q_n + αR_n
Unrolling the recursion:
Q_{n+1} = α R_n + (1-α) α R_{n-1} + (1-α)² α R_{n-2} + ... + (1-α)^n Q_1
= (1-α)^n Q_1 + Σ_{i=1}^{n} α(1-α)^{n-i} R_i
This is an exponentially weighted moving average (EWMA) — recent rewards count more:
Weight on R_i = α(1-α)^{n-i}
Weight on Q_1 = (1-α)^n → 0 as n → ∞
Since 0 < (1-α) < 1, weights decay geometrically. Half-life ≈ log(0.5)/log(1-α) steps.
Convergence Conditions for Stepsize α_n
Robbins-Monro conditions for convergence:
Σ_{n=1}^{∞} α_n = ∞ (steps are large enough to overcome initial conditions)
Σ_{n=1}^{∞} α_n² < ∞ (steps become small enough to converge)
| Stepsize | Σ α_n = ∞? | Σ α_n² < ∞? | Use case |
|---|---|---|---|
| α_n = 1/n | Yes | Yes | Stationary (sample mean) |
| α = const | Yes | No (= ∞) | Nonstationary (EWMA) |
| α_n = 1/n² | Yes | Yes | (academic, not used in RL) |
Practical takeaway: constant α is preferred in RL even for stationary problems because environments often drift slightly. α = 0.1 is a common choice.
Bias in Initial Estimates
With constant α, initial Q_1 bias decays exponentially:
Q_{n+1} = (1-α)^n Q_1 + Σ weighted rewards
After log(0.01)/log(1-α) ≈ -log(0.01)/α = 4.6/α steps, initial bias reduced to 1% of original.
For α = 0.1: ~46 steps to reduce initial bias by 99%.
Optimistic initial values exploit this bias beneficially — see file 3.
Comparison: Stationary vs Nonstationary Setting
| Sample Mean (α=1/n) | EWMA (α=const) | |
|---|---|---|
| Convergence (stationary) | Yes, guaranteed | No (oscillates) |
| Tracks changes (nonstationary) | No (forgets too slowly) | Yes |
| Bias | Decays to 0 | Persists (exponential decay) |
| DynamICCL relevance | Only if network is constant | Preferred (network changes) |
Connection to DynamICCL
In DynamICCL, network conditions change as training progresses (bandwidth contention, topology changes, varying batch sizes). EWMA with constant α is the right estimator because Q-values should track recent performance, not average over all history. The constant stepsize naturally forgets stale observations about previous network states.