Generalized Advantage Estimation and TD(λ) Preview

Chapter 7 — n-Step Bootstrapping

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 155–165 (preview; full treatment in Ch.12)

The λ-Return

Instead of picking a single n, use a weighted average of all n-step returns:

G_t^λ = (1-λ) Σ_{n=1}^{∞} λ^{n-1} G_t^{(n)}

Weights: (1-λ), (1-λ)λ, (1-λ)λ², …, (sum = 1 for λ < 1)

Interpretation: λ controls how much we weight short-horizon vs long-horizon returns: - λ = 0: G_t^0 = G_t^{(1)} = R_{t+1} + γV(S_{t+1}) — pure TD(0) - λ = 1: G_t^1 = G_t^{(T)} = G_t — pure MC - λ = 0.9: exponentially decaying weights; ~10-step effective horizon

Update rule (forward view of TD(λ)):

V(S_t) ← V(S_t) + α [G_t^λ - V(S_t)]

Problem: G_t^λ requires future rewards (like MC) — must wait until episode end to compute.

Equivalent backward view (eligibility traces):

e_t(s) = γλ e_{t-1}(s) + 1[S_t = s]    ← update trace
V(s) ← V(s) + α δ_t e_t(s)              ← all states simultaneously

This is exactly TD(λ) — computes the same updates as the forward view but online (Ch.12 derivation).

Generalized Advantage Estimation (GAE)

GAE (Schulman et al., 2015) applies the λ-return idea to advantage estimation for policy gradient:

Advantage function:

A^π(s_t, a_t) = Q^π(s_t, a_t) - V^π(s_t)

TD residuals (δ_t):

δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t) = 1-step TD error

Note: δ_t is an estimator of A^π(S_t, A_t) — unbiased if V = V^π.

n-Step advantage:

Â_t^{(n)} = Σ_{l=0}^{n-1} γ^l δ_{t+l} = G_t^{(n)} - V(S_t)

GAE(λ) = λ-weighted average of n-step advantages:

Â_t^{GAE(λ)} = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}
              = δ_t + γλ δ_{t+1} + (γλ)² δ_{t+2} + ...

Key property: this is a geometrically weighted sum of TD errors — exponentially down-weights distant steps.

GAE Derivation

Proof that GAE = λ-return:

G_t^{(n)} - V(S_t) = Σ_{l=0}^{n-1} γ^l δ_{t+l}

(Proof: G_t^{(n)} - V(S_t) = R_{t+1} + γV(S_{t+1}) - V(S_t) + γ[R_{t+2} + γV(S_{t+2}) - V(S_{t+1})] + … = Σ γ^l δ_{t+l})

Therefore:

Â_t^{GAE(λ)} = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}
              = (1-λ) Σ_{n=1}^{∞} λ^{n-1} G_t^{(n)} - V(S_t)
              = G_t^λ - V(S_t)

GAE = λ-return minus value baseline = variance-reduced advantage estimate.

Practical Computation of GAE

For a finite trajectory of length T:

Â_{T-1} = δ_{T-1}              ← last step: no future
Â_{T-2} = δ_{T-2} + γλ Â_{T-1}
Â_{T-3} = δ_{T-3} + γλ Â_{T-2}
...
Â_t     = δ_t + γλ Â_{t+1}     ← backward recursion

Pseudocode:

advantages = np.zeros(T)
last_gae = 0
for t in reversed(range(T)):
    if t == T-1:
        next_value = 0 if done else V(S_T)
    else:
        next_value = V(S_{t+1})
    delta = R_{t+1} + gamma * next_value - V(S_t)
    advantages[t] = last_gae = delta + gamma * lam * last_gae
returns[t] = advantages[t] + V(S_t)  # = G_t^λ

This is exactly what PPO, A2C, and DynamICCL implement for advantage estimation.

Bias-Variance of GAE

Bias[Â_t^{GAE(λ)}] ≈ γ^n λ^n ||V - V^π||_∞ / (1-γ)   (for large n)
Var[Â_t^{GAE(λ)}]  decreases as λ decreases

λ controls bias-variance tradeoff:

λ	Bias	Variance	When to use
0	High (1-step TD only)	Low	When V is accurate; short horizon
0.9	Medium	Medium	Most practical settings
0.95	Low	Medium-high	Long-horizon tasks
1.0	Zero	High (MC)	When V inaccurate; short episodes

PPO defaults: γ=0.99, λ=0.95. This gives ~20-step effective horizon (1/(1-0.95)=20) — captures multi-step credit assignment without excessive variance.

Unified Picture: All Return Estimates

                    n=1  n=2  n=3  ...  n=∞
G_t^{(n)}:          TD   2-step  3-step  ...  MC

G_t^λ:             λ=0  λ=0.5  λ=0.9  ...  λ=1
                   (TD)                       (MC)

The λ-return (and GAE) is the continuous dial between pure TD and pure MC.

TD(λ) / GAE(λ,γ) subsumes everything: - λ=0, γ=1: REINFORCE (MC policy gradient) - λ=0.95, γ=0.99: PPO defaults - λ=1, γ=0.99: roughly MC with discounting

Connection to DynamICCL

DynamICCL uses PPO with GAE(λ=0.95, γ=0.99):

Advantage for NCCL step t:
  Â_t = δ_t + 0.99·0.95·δ_{t+1} + (0.99·0.95)²·δ_{t+2} + ...
       ≈ Σ_{l=0}^{~20} (0.9405)^l δ_{t+l}

Where δ_t = (throughput at t+1) - predicted_throughput(t)

This means DynamICCL’s advantage signal captures approximately 20 steps of future NCCL performance when evaluating each configuration decision — appropriate for capturing how a configuration choice affects throughput over a multi-step scheduling window.