Generalized Advantage Estimation and TD(λ) Preview
Chapter 7 — n-Step Bootstrapping
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 155–165 (preview; full treatment in Ch.12)
The λ-Return
Instead of picking a single n, use a weighted average of all n-step returns:
G_t^λ = (1-λ) Σ_{n=1}^{∞} λ^{n-1} G_t^{(n)}
Weights: (1-λ), (1-λ)λ, (1-λ)λ², …, (sum = 1 for λ < 1)
Interpretation: λ controls how much we weight short-horizon vs long-horizon returns: - λ = 0: G_t^0 = G_t^{(1)} = R_{t+1} + γV(S_{t+1}) — pure TD(0) - λ = 1: G_t^1 = G_t^{(T)} = G_t — pure MC - λ = 0.9: exponentially decaying weights; ~10-step effective horizon
Update rule (forward view of TD(λ)):
V(S_t) ← V(S_t) + α [G_t^λ - V(S_t)]
Problem: G_t^λ requires future rewards (like MC) — must wait until episode end to compute.
Equivalent backward view (eligibility traces):
e_t(s) = γλ e_{t-1}(s) + 1[S_t = s] ← update trace
V(s) ← V(s) + α δ_t e_t(s) ← all states simultaneously
This is exactly TD(λ) — computes the same updates as the forward view but online (Ch.12 derivation).
Generalized Advantage Estimation (GAE)
GAE (Schulman et al., 2015) applies the λ-return idea to advantage estimation for policy gradient:
Advantage function:
A^π(s_t, a_t) = Q^π(s_t, a_t) - V^π(s_t)
TD residuals (δ_t):
δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t) = 1-step TD error
Note: δ_t is an estimator of A^π(S_t, A_t) — unbiased if V = V^π.
n-Step advantage:
Â_t^{(n)} = Σ_{l=0}^{n-1} γ^l δ_{t+l} = G_t^{(n)} - V(S_t)
GAE(λ) = λ-weighted average of n-step advantages:
Â_t^{GAE(λ)} = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}
= δ_t + γλ δ_{t+1} + (γλ)² δ_{t+2} + ...
Key property: this is a geometrically weighted sum of TD errors — exponentially down-weights distant steps.
GAE Derivation
Proof that GAE = λ-return:
G_t^{(n)} - V(S_t) = Σ_{l=0}^{n-1} γ^l δ_{t+l}
(Proof: G_t^{(n)} - V(S_t) = R_{t+1} + γV(S_{t+1}) - V(S_t) + γ[R_{t+2} + γV(S_{t+2}) - V(S_{t+1})] + … = Σ γ^l δ_{t+l})
Therefore:
Â_t^{GAE(λ)} = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}
= (1-λ) Σ_{n=1}^{∞} λ^{n-1} G_t^{(n)} - V(S_t)
= G_t^λ - V(S_t)
GAE = λ-return minus value baseline = variance-reduced advantage estimate.
Practical Computation of GAE
For a finite trajectory of length T:
Â_{T-1} = δ_{T-1} ← last step: no future
Â_{T-2} = δ_{T-2} + γλ Â_{T-1}
Â_{T-3} = δ_{T-3} + γλ Â_{T-2}
...
Â_t = δ_t + γλ Â_{t+1} ← backward recursion
Pseudocode:
advantages = np.zeros(T)
last_gae = 0
for t in reversed(range(T)):
if t == T-1:
next_value = 0 if done else V(S_T)
else:
next_value = V(S_{t+1})
delta = R_{t+1} + gamma * next_value - V(S_t)
advantages[t] = last_gae = delta + gamma * lam * last_gae
returns[t] = advantages[t] + V(S_t) # = G_t^λ
This is exactly what PPO, A2C, and DynamICCL implement for advantage estimation.
Bias-Variance of GAE
Bias[Â_t^{GAE(λ)}] ≈ γ^n λ^n ||V - V^π||_∞ / (1-γ) (for large n)
Var[Â_t^{GAE(λ)}] decreases as λ decreases
λ controls bias-variance tradeoff:
| λ | Bias | Variance | When to use |
|---|---|---|---|
| 0 | High (1-step TD only) | Low | When V is accurate; short horizon |
| 0.9 | Medium | Medium | Most practical settings |
| 0.95 | Low | Medium-high | Long-horizon tasks |
| 1.0 | Zero | High (MC) | When V inaccurate; short episodes |
PPO defaults: γ=0.99, λ=0.95. This gives ~20-step effective horizon (1/(1-0.95)=20) — captures multi-step credit assignment without excessive variance.
Unified Picture: All Return Estimates
n=1 n=2 n=3 ... n=∞
G_t^{(n)}: TD 2-step 3-step ... MC
G_t^λ: λ=0 λ=0.5 λ=0.9 ... λ=1
(TD) (MC)
The λ-return (and GAE) is the continuous dial between pure TD and pure MC.
TD(λ) / GAE(λ,γ) subsumes everything: - λ=0, γ=1: REINFORCE (MC policy gradient) - λ=0.95, γ=0.99: PPO defaults - λ=1, γ=0.99: roughly MC with discounting
Connection to DynamICCL
DynamICCL uses PPO with GAE(λ=0.95, γ=0.99):
Advantage for NCCL step t:
Â_t = δ_t + 0.99·0.95·δ_{t+1} + (0.99·0.95)²·δ_{t+2} + ...
≈ Σ_{l=0}^{~20} (0.9405)^l δ_{t+l}
Where δ_t = (throughput at t+1) - predicted_throughput(t)
This means DynamICCL’s advantage signal captures approximately 20 steps of future NCCL performance when evaluating each configuration decision — appropriate for capturing how a configuration choice affects throughput over a multi-step scheduling window.