n-Step TD Prediction

Chapter 7 — n-Step Bootstrapping

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 141–155

The n-Step Return

Problem: TD(0) uses 1 step before bootstrapping → low variance, biased. MC uses T steps → unbiased, high variance. Can we interpolate?

n-step return: look ahead n steps, then bootstrap:

G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n V(S_{t+n})
           = Σ_{k=0}^{n-1} γ^k R_{t+k+1} + γ^n V(S_{t+n})

Special cases: - n=1: G_t^{(1)} = R_{t+1} + γV(S_{t+1}) — TD(0) target - n=T: G_t^{(T)} = R_{t+1} + … + R_T — MC return (terminal V=0) - n=2: G_t^{(2)} = R_{t+1} + γR_{t+2} + γ²V(S_{t+2})

n-step TD update:

V(S_t) ← V(S_t) + α [G_t^{(n)} - V(S_t)]

Wait n steps before updating. At step t+n: have all rewards R_{t+1},…,R_{t+n} and can compute G_t^{(n)}.

Algorithm: n-Step TD

Input: π, n, α ∈ (0,1], γ

Loop (episodes):
  Initialize S_0
  Store S_0 in S[0]
  T ← ∞    ← terminal time (unknown until reached)

  For t = 0, 1, 2, ...:
    If t < T:
      Take A_t ~ π(·|S_t)
      Observe R_{t+1}, S_{t+1}
      Store in R[t+1], S[t+1]
      If S_{t+1} is terminal: T ← t+1

    τ ← t - n + 1    ← time step to update

    If τ ≥ 0:
      G ← Σ_{i=τ+1}^{min(τ+n, T)} γ^{i-τ-1} R[i]    ← sum rewards
      If τ+n < T:
        G ← G + γ^n V(S[τ+n])                         ← bootstrap if not terminal
      V(S[τ]) ← V(S[τ]) + α[G - V(S[τ])]              ← update

  Until τ = T-1   ← all states in episode updated

Delay: update for S_t happens n steps later (at time t+n). This requires storing the last n rewards and states.

Memory: O(n) — need to buffer n steps of (S, A, R).

Intermediate n is Often Best

Theorem (Watkins, 1989; empirical): there exists an intermediate n* that minimizes prediction error, with n* ≈ O(1/(1-γ)).

Intuition: - Small n: low variance, but bias from early bootstrap; misses long-range dependencies - Large n: captures long-range dependencies, but high variance from many random steps - Optimal n: balances bias and variance

10-state random walk example (S&B Figure 7.2):

RMS error after 10 episodes vs n:

n=1:    RMSE ≈ 0.15   (TD — too much bias)
n=4:    RMSE ≈ 0.10   ← optimal around n=4
n=16:   RMSE ≈ 0.12
n=T:    RMSE ≈ 0.18   (MC — too much variance)

The intermediate n consistently outperforms both extremes.

n-Step Control: SARSA

n-Step SARSA target (on-policy):

G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n Q(S_{t+n}, A_{t+n})

Uses the actual next actions A_{t+1}, …, A_{t+n} — maintains on-policy nature.

Update:

Q(S_t, A_t) ← Q(S_t, A_t) + α [G_t^{(n)} - Q(S_t, A_t)]

n-Step Q(σ): unifies SARSA (σ=1, sample) and Expected SARSA (σ=0, expectation):

G_t^{(n)} = R_{t+1} + γ [(1-σ_{t+1}) Σ_a π(a|S_{t+1})Q(S_{t+1},a) + σ_{t+1} Q(S_{t+1},A_{t+1})]
           + γ² [...]

Off-Policy n-Step TD with Importance Sampling

Off-Policy n-step return: importance sampling ratio for n-step trajectory:

G_t^{(n)} = R_{t+1} + ... + γ^{n-1}R_{t+n} + γ^n V(S_{t+n})
ρ_{t+1:t+n} = Π_{k=t+1}^{min(t+n-1, T-1)} π(A_k|S_k) / b(A_k|S_k}

Off-policy n-step update:

V(S_t) ← V(S_t) + α ρ_{t+1:t+n} [G_t^{(n)} - V(S_t)]

Problem: ρ_{t+1:t+n} can be very large (as in off-policy MC with n → T). Variance grows with n.

n-Step tree backup (no IS needed): instead of sampling, take expected values at each step:

G_t^{(n)} = R_{t+1} + γ [Q(S_{t+1},A_{t+1}) + Σ_{a ≠ A_{t+1}} π(a|S_{t+1})Q(S_{t+1},a)]
             + γ² [Q(S_{t+2},A_{t+2}) + Σ_{a ≠ A_{t+2}} π(a|S_{t+2})Q(S_{t+2},a)]
             + ...

Tree backup uses expected values → no IS needed → lower variance for off-policy learning.

Bias-Variance of n-Step Returns

Variance grows with n (roughly):

Var[G_t^{(n)}] ≈ Σ_{k=0}^{n-1} γ^{2k} Var[R_{t+k+1}] + γ^{2n} Var[V(S_{t+n})]

For constant reward variance σ²_R:

Var[G_t^{(n)}] ≈ σ²_R (1 - γ^{2n}) / (1 - γ²) + γ^{2n} Var[V(S_{t+n})]

As n → ∞: Var → σ²_R/(1-γ²) + 0 = MC variance.

Bias of G_t^{(n)}:

E[G_t^{(n)}] - V^π(S_t) = γ^n E[V(S_{t+n}) - V^π(S_{t+n})]

Bias = γ^n × (current value function error at S_{t+n}). As n → ∞: bias → 0 (MC).

Choosing n in Practice

Problem type	Recommended n
Short episodes (T ≤ 20)	n = T (MC) or n = T/2
Long episodes (T ≥ 100)	n = 5-20
Continuing tasks	n = 10-100
Sparse rewards	Large n (need more steps to see reward)
Dense rewards	Small n (can bootstrap frequently)

In deep RL practice: PPO uses GAE (λ ∈ 0.9-0.99), A3C/A2C uses n=5-20 step returns, DQN uses n=1.

Connection to DynamICCL

DynamICCL’s Generalized Advantage Estimation (GAE) is exactly a weighted combination of n-step returns:

GAE-λ: Â_t = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}

This is a geometric mixture of all n-step TD errors, with λ controlling the n-step horizon: - λ=0: Â_t = δ_t = 1-step TD error (n=1) - λ=1: Â_t = G_t - V(S_t) = MC return minus baseline (n=T) - λ=0.95: heavy weighting on ~20-step returns (1/(1-0.95)=20)

For NCCL tuning with long job horizons, λ=0.95-0.99 gives the right balance: captures enough temporal context (job completion trends) while keeping variance manageable.