n-Step TD Prediction
Chapter 7 — n-Step Bootstrapping
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 141–155
The n-Step Return
Problem: TD(0) uses 1 step before bootstrapping → low variance, biased. MC uses T steps → unbiased, high variance. Can we interpolate?
n-step return: look ahead n steps, then bootstrap:
G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n V(S_{t+n})
= Σ_{k=0}^{n-1} γ^k R_{t+k+1} + γ^n V(S_{t+n})
Special cases: - n=1: G_t^{(1)} = R_{t+1} + γV(S_{t+1}) — TD(0) target - n=T: G_t^{(T)} = R_{t+1} + … + R_T — MC return (terminal V=0) - n=2: G_t^{(2)} = R_{t+1} + γR_{t+2} + γ²V(S_{t+2})
n-step TD update:
V(S_t) ← V(S_t) + α [G_t^{(n)} - V(S_t)]
Wait n steps before updating. At step t+n: have all rewards R_{t+1},…,R_{t+n} and can compute G_t^{(n)}.
Algorithm: n-Step TD
Input: π, n, α ∈ (0,1], γ
Loop (episodes):
Initialize S_0
Store S_0 in S[0]
T ← ∞ ← terminal time (unknown until reached)
For t = 0, 1, 2, ...:
If t < T:
Take A_t ~ π(·|S_t)
Observe R_{t+1}, S_{t+1}
Store in R[t+1], S[t+1]
If S_{t+1} is terminal: T ← t+1
τ ← t - n + 1 ← time step to update
If τ ≥ 0:
G ← Σ_{i=τ+1}^{min(τ+n, T)} γ^{i-τ-1} R[i] ← sum rewards
If τ+n < T:
G ← G + γ^n V(S[τ+n]) ← bootstrap if not terminal
V(S[τ]) ← V(S[τ]) + α[G - V(S[τ])] ← update
Until τ = T-1 ← all states in episode updated
Delay: update for S_t happens n steps later (at time t+n). This requires storing the last n rewards and states.
Memory: O(n) — need to buffer n steps of (S, A, R).
Intermediate n is Often Best
Theorem (Watkins, 1989; empirical): there exists an intermediate n* that minimizes prediction error, with n* ≈ O(1/(1-γ)).
Intuition: - Small n: low variance, but bias from early bootstrap; misses long-range dependencies - Large n: captures long-range dependencies, but high variance from many random steps - Optimal n: balances bias and variance
10-state random walk example (S&B Figure 7.2):
RMS error after 10 episodes vs n:
n=1: RMSE ≈ 0.15 (TD — too much bias)
n=4: RMSE ≈ 0.10 ← optimal around n=4
n=16: RMSE ≈ 0.12
n=T: RMSE ≈ 0.18 (MC — too much variance)
The intermediate n consistently outperforms both extremes.
n-Step Control: SARSA
n-Step SARSA target (on-policy):
G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n Q(S_{t+n}, A_{t+n})
Uses the actual next actions A_{t+1}, …, A_{t+n} — maintains on-policy nature.
Update:
Q(S_t, A_t) ← Q(S_t, A_t) + α [G_t^{(n)} - Q(S_t, A_t)]
n-Step Q(σ): unifies SARSA (σ=1, sample) and Expected SARSA (σ=0, expectation):
G_t^{(n)} = R_{t+1} + γ [(1-σ_{t+1}) Σ_a π(a|S_{t+1})Q(S_{t+1},a) + σ_{t+1} Q(S_{t+1},A_{t+1})]
+ γ² [...]
Off-Policy n-Step TD with Importance Sampling
Off-Policy n-step return: importance sampling ratio for n-step trajectory:
G_t^{(n)} = R_{t+1} + ... + γ^{n-1}R_{t+n} + γ^n V(S_{t+n})
ρ_{t+1:t+n} = Π_{k=t+1}^{min(t+n-1, T-1)} π(A_k|S_k) / b(A_k|S_k}
Off-policy n-step update:
V(S_t) ← V(S_t) + α ρ_{t+1:t+n} [G_t^{(n)} - V(S_t)]
Problem: ρ_{t+1:t+n} can be very large (as in off-policy MC with n → T). Variance grows with n.
n-Step tree backup (no IS needed): instead of sampling, take expected values at each step:
G_t^{(n)} = R_{t+1} + γ [Q(S_{t+1},A_{t+1}) + Σ_{a ≠ A_{t+1}} π(a|S_{t+1})Q(S_{t+1},a)]
+ γ² [Q(S_{t+2},A_{t+2}) + Σ_{a ≠ A_{t+2}} π(a|S_{t+2})Q(S_{t+2},a)]
+ ...
Tree backup uses expected values → no IS needed → lower variance for off-policy learning.
Bias-Variance of n-Step Returns
Variance grows with n (roughly):
Var[G_t^{(n)}] ≈ Σ_{k=0}^{n-1} γ^{2k} Var[R_{t+k+1}] + γ^{2n} Var[V(S_{t+n})]
For constant reward variance σ²_R:
Var[G_t^{(n)}] ≈ σ²_R (1 - γ^{2n}) / (1 - γ²) + γ^{2n} Var[V(S_{t+n})]
As n → ∞: Var → σ²_R/(1-γ²) + 0 = MC variance.
Bias of G_t^{(n)}:
E[G_t^{(n)}] - V^π(S_t) = γ^n E[V(S_{t+n}) - V^π(S_{t+n})]
Bias = γ^n × (current value function error at S_{t+n}). As n → ∞: bias → 0 (MC).
Choosing n in Practice
| Problem type | Recommended n |
|---|---|
| Short episodes (T ≤ 20) | n = T (MC) or n = T/2 |
| Long episodes (T ≥ 100) | n = 5-20 |
| Continuing tasks | n = 10-100 |
| Sparse rewards | Large n (need more steps to see reward) |
| Dense rewards | Small n (can bootstrap frequently) |
In deep RL practice: PPO uses GAE (λ ∈ 0.9-0.99), A3C/A2C uses n=5-20 step returns, DQN uses n=1.
Connection to DynamICCL
DynamICCL’s Generalized Advantage Estimation (GAE) is exactly a weighted combination of n-step returns:
GAE-λ: Â_t = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}
This is a geometric mixture of all n-step TD errors, with λ controlling the n-step horizon: - λ=0: Â_t = δ_t = 1-step TD error (n=1) - λ=1: Â_t = G_t - V(S_t) = MC return minus baseline (n=T) - λ=0.95: heavy weighting on ~20-step returns (1/(1-0.95)=20)
For NCCL tuning with long job horizons, λ=0.95-0.99 gives the right balance: captures enough temporal context (job completion trends) while keeping variance manageable.