ε-Greedy Action Selection and Incremental Updates

Chapter 2 — Multi-Armed Bandits

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 27–33


ε-Greedy Algorithm

Simplest effective method for balancing exploration and exploitation:

With probability (1-ε): A_t = argmax_a Q_t(a)     ← exploit
With probability ε:     A_t = uniform random action ← explore

Properties: - ε = 0: pure greedy (no exploration) - ε = 1: pure random (no exploitation) - ε = 0.1: explore 10% of time, exploit 90%

Convergence guarantee: for stationary distributions, ε-greedy converges to near-optimal because every action is tried infinitely often → Q_t(a) → q*(a) for all a.

Asymptotically: fraction of time optimal action chosen → (1-ε) + ε/k = 1 - ε(k-1)/k

Performance on 10-Armed Testbed

ε Early performance Asymptotic performance
0 (greedy) Best initially Worst (stuck)
0.01 Slow to find best Nearly optimal (~99% optimal)
0.1 Finds best fastest Suboptimal (10% wasted)

Lesson: ε = 0.1 converges faster; ε = 0.01 achieves better asymptotic reward. Problem-dependent which is better.

Adaptive ε: Annealing

ε_t = ε_0 / t   (or 1/log(t), or any schedule → 0)

Explore heavily early, exploit as confidence grows. Optimal annealing schedules exist (e.g., UCB achieves Θ(log T) regret automatically; see file 3).


Incremental Update Rule

Problem: storing all rewards and recomputing Q_t(a) = mean is O(n) memory and time.

Solution: update incrementally:

Derivation

Let Q_n denote estimate after n rewards from action a:

Q_n = (R_1 + R_2 + ... + R_{n-1}) / (n-1)

We receive a new reward R_n. The new estimate:

Q_{n+1} = (R_1 + ... + R_{n-1} + R_n) / n
         = (n-1)/n · Q_n + (1/n) · R_n
         = Q_n + (1/n)(R_n - Q_n)

General form:

Q_{n+1} = Q_n + α_n (R_n - Q_n)

where: - α_n = 1/n for sample mean (stepsize shrinks) - R_n - Q_n = TD error or prediction error — how much the new reward surprised us - Q_n: old estimate; Q_n + α·error: updated estimate

This is the fundamental update pattern of RL. The term (R_n - Q_n) is the target minus the current estimate.

Pseudocode: ε-Greedy Bandit

Initialize:
  Q(a) = 0 for all a ∈ {1,...,k}
  N(a) = 0 for all a ∈ {1,...,k}

Loop for each step t = 1, 2, ...:
  With prob (1-ε): A = argmax_a Q(a)   [ties broken randomly]
  With prob ε:     A = random action from {1,...,k}

  Observe reward R from environment

  N(A) ← N(A) + 1
  Q(A) ← Q(A) + (1/N(A)) · (R - Q(A))    ← incremental update

Space: O(k). Time per step: O(k) for argmax.


Nonstationary Problems: Constant Stepsize

For nonstationary q*(a) (values change over time), use constant stepsize α ∈ (0, 1]:

Q_{n+1} = Q_n + α(R_n - Q_n)
         = (1-α)Q_n + αR_n

Unrolling the recursion:

Q_{n+1} = α R_n + (1-α) α R_{n-1} + (1-α)² α R_{n-2} + ... + (1-α)^n Q_1
         = (1-α)^n Q_1 + Σ_{i=1}^{n} α(1-α)^{n-i} R_i

This is an exponentially weighted moving average (EWMA) — recent rewards count more:

Weight on R_i = α(1-α)^{n-i}

Weight on Q_1 = (1-α)^n → 0 as n → ∞

Since 0 < (1-α) < 1, weights decay geometrically. Half-life ≈ log(0.5)/log(1-α) steps.

Convergence Conditions for Stepsize α_n

Robbins-Monro conditions for convergence:

Σ_{n=1}^{∞} α_n = ∞     (steps are large enough to overcome initial conditions)
Σ_{n=1}^{∞} α_n² < ∞    (steps become small enough to converge)
Stepsize Σ α_n = ∞? Σ α_n² < ∞? Use case
α_n = 1/n Yes Yes Stationary (sample mean)
α = const Yes No (= ∞) Nonstationary (EWMA)
α_n = 1/n² Yes Yes (academic, not used in RL)

Practical takeaway: constant α is preferred in RL even for stationary problems because environments often drift slightly. α = 0.1 is a common choice.


Bias in Initial Estimates

With constant α, initial Q_1 bias decays exponentially:

Q_{n+1} = (1-α)^n Q_1 + Σ weighted rewards

After log(0.01)/log(1-α) ≈ -log(0.01)/α = 4.6/α steps, initial bias reduced to 1% of original.

For α = 0.1: ~46 steps to reduce initial bias by 99%.

Optimistic initial values exploit this bias beneficially — see file 3.


Comparison: Stationary vs Nonstationary Setting

Sample Mean (α=1/n) EWMA (α=const)
Convergence (stationary) Yes, guaranteed No (oscillates)
Tracks changes (nonstationary) No (forgets too slowly) Yes
Bias Decays to 0 Persists (exponential decay)
DynamICCL relevance Only if network is constant Preferred (network changes)

Connection to DynamICCL

In DynamICCL, network conditions change as training progresses (bandwidth contention, topology changes, varying batch sizes). EWMA with constant α is the right estimator because Q-values should track recent performance, not average over all history. The constant stepsize naturally forgets stale observations about previous network states.