Per-Decision Methods and Chapter 7 Summary

Chapter 7 — n-Step Bootstrapping

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 148–158


Per-Decision Importance Sampling

Problem with standard n-step off-policy IS:

For behavior policy b and target policy π, the n-step IS ratio:

ρ_{t+1:t+n} = Π_{k=t+1}^{t+n} π(A_k|S_k) / b(A_k|S_k)

Applied to the entire n-step return creates high variance — exponential in n.

Per-decision IS: apply IS ratio to each reward separately:

G_{t|π}^{(n)} = Σ_{k=t}^{t+n-1} γ^{k-t} ρ_{t:k} R_{k+1} + γ^n ρ_{t:t+n-1} V(S_{t+n})

where ρ_{t:k} = Π_{j=t}^{k} π(A_j|S_j)/b(A_j|S_j)

This has lower variance than applying a single IS ratio to the whole return, while remaining unbiased.

Practical implementation: used in V-trace (IMPALA) and ACER for distributed RL systems where workers collect data under different policy versions.


V-trace: Production Off-Policy n-Step Return

V-trace (Espeholt et al., 2018) — used in IMPALA (distributed actor-learner):

V-trace target for s_t:
  v_s = V(s_t) + Σ_{k=t}^{t+n-1} γ^{k-t} (Π_{j=t}^{k-1} c_j) δ_k

where:
  δ_k = ρ_k [R_{k+1} + γV(s_{k+1}) - V(s_k)]      ← IS-corrected TD error
  ρ_k = min(ρ̄, π(A_k|S_k)/b(A_k|S_k))             ← clipped IS ratio
  c_j = min(c̄, π(A_j|S_j)/b(A_j|S_j))             ← trace coefficient

Two IS clips: - ρ̄: controls effective IS for importance of the target value (typically ρ̄ = 1.0) - c̄: controls effective IS for credit assignment (typically c̄ = 1.0)

Relationship to PPO: V-trace clips the IS ratio multiplicatively (min with c̄/ρ̄); PPO clips the ratio additively (clamp to [1-ε, 1+ε]). Both are approximate off-policy corrections.


n-Step Return with Function Approximation

When V(s) is approximated by V(s; θ) (neural net, linear):

θ ← θ + α [G_t^{(n)} - V(S_t; θ)] ∇_θ V(S_t; θ)

The n-step return G_t^{(n)} uses V(S_{t+n}; θ) for the bootstrap value — semi-gradient update.

Why semi-gradient?: the “target” V(S_{t+n}; θ) depends on θ, but we don’t differentiate through it. This is like Q-learning’s target network update — necessary for stability.

Convergence: tabular n-step TD converges; function approximation convergence is problem-specific (linear: yes; nonlinear: not guaranteed, but empirically works).


Comparison: n-Step Methods Summary Table

Method Update Bias Variance Online? Notes
TD(0) (n=1) 1-step High (bootstrap) Low (1 sample) Yes Fast, simple
2-step TD 2-step Medium Low-medium Yes
n-step TD n-step Decreasing Increasing n-delay Optimal n exists
G_t^λ (TD(λ)) Weighted all n Controlled by λ Controlled by λ Yes (traces) Best of all worlds
MC (n=T) Full episode Zero High No Unbiased

Performance ordering (generally):

Optimal n-step ≥ TD(λ) ≥ MC ≥ TD(0)  (for prediction accuracy)

But optimal n and λ are problem-dependent — in practice, TD(λ) with λ=0.9-0.95 is most robust.


The n-Step TD Family Tree

TD(0) ──────────────────────────────── MC
  │                                     │
  │   n-step returns                    │
  │   G_t^{(n)} = Σ γ^k R_{t+k+1} + γ^n V(S_{t+n})
  │                                     │
  └──────── TD(λ) ─────────────────────┘
              │
              │  λ-weighted average of all n
              G_t^λ = (1-λ) Σ_{n=1}^∞ λ^{n-1} G_t^{(n)}
              │
              │  with function approximation
              └── GAE (Generalized Advantage Estimation)
                  Â_t^λ = Σ (γλ)^l δ_{t+l}  ← forward view
                       = eligibility traces ← backward view

Chapter 7 Key Takeaways

  1. n-step returns interpolate TD (n=1) and MC (n=T) — optimal n exists
  2. λ-return (TD(λ)) takes weighted average of all n — typically optimal in practice
  3. GAE = λ-return advantage estimation — used in PPO, A2C, A3C
  4. Per-decision IS reduces variance of off-policy n-step methods
  5. V-trace is the production version for distributed off-policy RL (IMPALA)
  6. Bias-variance tradeoff: small n (low variance, bias from V estimate); large n (low bias, high variance)

Practical advice: - If V is well-estimated: use small n or small λ (trust the bootstrap) - If V is poorly estimated (early training): use large n or large λ (trust the returns) - Default starting point: λ=0.95, γ=0.99 (PPO defaults)


Connection to DynamICCL

Ch.7 Concept DynamICCL Implementation
n-step return GAE with λ=0.95, γ=0.99
G_t^λ Returns = advantages + V(S_t) for value loss
Per-decision IS PPO’s clipped ratio (approximate, bounded)
V-trace Used if DynamICCL distributes collection across multiple GPU clusters
Optimal n ~20 steps at λ=0.95 — captures NCCL config effects over multi-step window
Bootstrap value V(S_{t+n}; θ) from PPO’s value network