Per-Decision Methods and Chapter 7 Summary
Chapter 7 — n-Step Bootstrapping
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 148–158
Per-Decision Importance Sampling
Problem with standard n-step off-policy IS:
For behavior policy b and target policy π, the n-step IS ratio:
ρ_{t+1:t+n} = Π_{k=t+1}^{t+n} π(A_k|S_k) / b(A_k|S_k)
Applied to the entire n-step return creates high variance — exponential in n.
Per-decision IS: apply IS ratio to each reward separately:
G_{t|π}^{(n)} = Σ_{k=t}^{t+n-1} γ^{k-t} ρ_{t:k} R_{k+1} + γ^n ρ_{t:t+n-1} V(S_{t+n})
where ρ_{t:k} = Π_{j=t}^{k} π(A_j|S_j)/b(A_j|S_j)
This has lower variance than applying a single IS ratio to the whole return, while remaining unbiased.
Practical implementation: used in V-trace (IMPALA) and ACER for distributed RL systems where workers collect data under different policy versions.
V-trace: Production Off-Policy n-Step Return
V-trace (Espeholt et al., 2018) — used in IMPALA (distributed actor-learner):
V-trace target for s_t:
v_s = V(s_t) + Σ_{k=t}^{t+n-1} γ^{k-t} (Π_{j=t}^{k-1} c_j) δ_k
where:
δ_k = ρ_k [R_{k+1} + γV(s_{k+1}) - V(s_k)] ← IS-corrected TD error
ρ_k = min(ρ̄, π(A_k|S_k)/b(A_k|S_k)) ← clipped IS ratio
c_j = min(c̄, π(A_j|S_j)/b(A_j|S_j)) ← trace coefficient
Two IS clips: - ρ̄: controls effective IS for importance of the target value (typically ρ̄ = 1.0) - c̄: controls effective IS for credit assignment (typically c̄ = 1.0)
Relationship to PPO: V-trace clips the IS ratio multiplicatively (min with c̄/ρ̄); PPO clips the ratio additively (clamp to [1-ε, 1+ε]). Both are approximate off-policy corrections.
n-Step Return with Function Approximation
When V(s) is approximated by V(s; θ) (neural net, linear):
θ ← θ + α [G_t^{(n)} - V(S_t; θ)] ∇_θ V(S_t; θ)
The n-step return G_t^{(n)} uses V(S_{t+n}; θ) for the bootstrap value — semi-gradient update.
Why semi-gradient?: the “target” V(S_{t+n}; θ) depends on θ, but we don’t differentiate through it. This is like Q-learning’s target network update — necessary for stability.
Convergence: tabular n-step TD converges; function approximation convergence is problem-specific (linear: yes; nonlinear: not guaranteed, but empirically works).
Comparison: n-Step Methods Summary Table
| Method | Update | Bias | Variance | Online? | Notes |
|---|---|---|---|---|---|
| TD(0) (n=1) | 1-step | High (bootstrap) | Low (1 sample) | Yes | Fast, simple |
| 2-step TD | 2-step | Medium | Low-medium | Yes | |
| n-step TD | n-step | Decreasing | Increasing | n-delay | Optimal n exists |
| G_t^λ (TD(λ)) | Weighted all n | Controlled by λ | Controlled by λ | Yes (traces) | Best of all worlds |
| MC (n=T) | Full episode | Zero | High | No | Unbiased |
Performance ordering (generally):
Optimal n-step ≥ TD(λ) ≥ MC ≥ TD(0) (for prediction accuracy)
But optimal n and λ are problem-dependent — in practice, TD(λ) with λ=0.9-0.95 is most robust.
The n-Step TD Family Tree
TD(0) ──────────────────────────────── MC
│ │
│ n-step returns │
│ G_t^{(n)} = Σ γ^k R_{t+k+1} + γ^n V(S_{t+n})
│ │
└──────── TD(λ) ─────────────────────┘
│
│ λ-weighted average of all n
G_t^λ = (1-λ) Σ_{n=1}^∞ λ^{n-1} G_t^{(n)}
│
│ with function approximation
└── GAE (Generalized Advantage Estimation)
Â_t^λ = Σ (γλ)^l δ_{t+l} ← forward view
= eligibility traces ← backward view
Chapter 7 Key Takeaways
- n-step returns interpolate TD (n=1) and MC (n=T) — optimal n exists
- λ-return (TD(λ)) takes weighted average of all n — typically optimal in practice
- GAE = λ-return advantage estimation — used in PPO, A2C, A3C
- Per-decision IS reduces variance of off-policy n-step methods
- V-trace is the production version for distributed off-policy RL (IMPALA)
- Bias-variance tradeoff: small n (low variance, bias from V estimate); large n (low bias, high variance)
Practical advice: - If V is well-estimated: use small n or small λ (trust the bootstrap) - If V is poorly estimated (early training): use large n or large λ (trust the returns) - Default starting point: λ=0.95, γ=0.99 (PPO defaults)
Connection to DynamICCL
| Ch.7 Concept | DynamICCL Implementation |
|---|---|
| n-step return | GAE with λ=0.95, γ=0.99 |
| G_t^λ | Returns = advantages + V(S_t) for value loss |
| Per-decision IS | PPO’s clipped ratio (approximate, bounded) |
| V-trace | Used if DynamICCL distributes collection across multiple GPU clusters |
| Optimal n | ~20 steps at λ=0.95 — captures NCCL config effects over multi-step window |
| Bootstrap value | V(S_{t+n}; θ) from PPO’s value network |