Off-Policy Methods with Function Approximation
Chapter 11 — Off-Policy Methods with Approximation
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 265–288
The Challenge of Off-Policy + Function Approximation
Off-policy learning (e.g., Q-learning) is powerful: learn optimal policy while following exploratory behavior. But combining with function approximation creates instability.
The Deadly Triad (S&B §11.3): three factors that together cause divergence: 1. Function approximation (generalization across states) 2. Bootstrapping (TD-style target uses estimated V) 3. Off-policy learning (data distribution ≠ target policy distribution)
Any two are fine; all three together → potential divergence.
Baird’s Counterexample (S&B §11.2): a 7-state MDP where semi-gradient Q-learning with linear FA diverges to infinity. Not a degenerate example — illustrates a fundamental instability.
Semi-Gradient Q-Learning with FA
Update rule:
θ ← θ + α [R_{t+1} + γ max_{a'} Q(S_{t+1}, a'; θ) - Q(S_t, A_t; θ)] ∇_θ Q(S_t, A_t; θ)
Problem: not true gradient descent — the target max Q(S_{t+1}, a’; θ) depends on θ.
Baird’s counterexample: 6 dashed states + 1 terminal. Behavior policy visits all states equally; semi-gradient Q-learning diverges regardless of step size.
Gradient TD Methods: True Stability
Gradient TD methods perform true stochastic gradient descent on a modified objective.
Mean Squared Projected Bellman Error (MSPBE):
MSPBE(θ) = ||V_θ - Π T^π V_θ||²_μ
where Π = projection onto the space of linear approximations, T^π = Bellman operator.
Key difference from VE: MSPBE measures how much the approximation violates the Bellman equation in the projected (function approximation) subspace.
GTD2 (Gradient TD, version 2):
δ_t = R_{t+1} + γθ^T φ_{t+1} - θ^T φ_t ← TD error
w ← w + β_w (δ_t - w^T φ_t) φ_t ← second weight vector w
θ ← θ + α_θ (φ_t - γ φ_{t+1}) w^T φ_t ← gradient update
TDC (TD with Correction):
δ_t = R_{t+1} + γθ^T φ_{t+1} - θ^T φ_t
w ← w + β(δ_t - w^T φ_t) φ_t
θ ← θ + α(δ_t φ_t - γ φ_{t+1} w^T φ_t) ← corrected update
Properties of GTD2/TDC: - True gradient descent on MSPBE → convergence guaranteed - Requires second weight vector w (double the memory of standard TD) - Slower than semi-gradient TD in practice (lower variance helps, but extra computation) - Converges even with function approximation + bootstrapping + off-policy
Emphatic TD
Alternative stable off-policy method: reweight state distribution to match on-policy distribution.
Emphasis: M_t = λM_{t-1} + I_t (interest function + trace)
Emphatic TD(0):
M_t = γλ M_{t-1} + 1 (with I = 1 uniform interest)
θ ← θ + α M_t ρ_t δ_t φ_t
where ρ_t = π(A_t|S_t)/b(A_t|S_t) is the IS ratio.
Emphatic TD multiplies each update by the emphasis M_t, effectively “restoring” the on-policy distribution weighting even when learning off-policy.
Convergence: proven for linear FA; slower than standard TD in practice.
DQN: Pragmatic Solution to the Deadly Triad
DQN (Mnih et al., 2015) doesn’t fix the deadly triad theoretically — it works around it with engineering:
Experience Replay
Replay buffer D = {(s,a,r,s')} (capacity N = 10^6)
At each step:
Store (s_t, a_t, r_t, s_{t+1}) in D
Sample random minibatch {(s_j, a_j, r_j, s'_j)} from D
Compute target: y_j = r_j + γ max_a Q(s'_j, a; θ^-)
Update: θ ← θ + α (y_j - Q(s_j, a_j; θ)) ∇_θ Q(s_j, a_j; θ)
Benefits: - Breaks temporal correlations (consecutive experiences are highly correlated → unstable training) - Each experience used multiple times → sample efficient - Random minibatch ≈ i.i.d. samples → closer to standard SGD assumptions
Target Network
θ^-: target parameters (frozen)
Every C = 10,000 steps: θ^- ← θ ← hard update
Or soft update (Polyak averaging):
θ^- ← τθ + (1-τ)θ^- (τ = 0.005 typically)
Benefits: stabilizes the bootstrap target; θ^- changes slowly → TD target is more stationary.
Double DQN
Standard DQN target: R + γ max_a Q(S’, a; θ^-)
Problem: max over Q(S’, a; θ^-) is biased high (maximization bias from Ch.6).
Double DQN target:
y = R + γ Q(S', argmax_a Q(S', a; θ); θ^-)
Use current network θ to select action, target network θ^- to evaluate.
This decorrelates selection and evaluation → removes maximization bias.
Empirical result (van Hasselt et al., 2016): Double DQN outperforms DQN on 41/49 Atari games, especially those where overestimation causes instability.
Dueling DQN
Decompose Q into value V and advantage A:
Q(s, a; θ) = V(s; θ_V) + A(s, a; θ_A) - mean_a' A(s, a'; θ_A)
(Subtract mean advantage to ensure identifiability: A’s sum to 0.)
Benefits: - V(s) is updated for every action (not just the chosen one) - Better for tasks where many actions have similar Q-values
Prioritized Experience Replay (PER)
Insight: replay transitions with large TD error more frequently.
Priority: p_i = |δ_i| + ε (TD error magnitude)
Sample probability: P(i) ∝ p_i^α (α=0.6 typical)
IS correction: w_i = (1 / (N P(i)))^β (correct for sampling bias)
Benefits: more learning signal per sample; faster convergence.
Chapter 11 Summary
- Deadly triad: FA + bootstrapping + off-policy → divergence risk
- Gradient TD (GTD2, TDC): true gradient descent → stable, but complex and slower
- Emphatic TD: re-weights samples to approximate on-policy distribution
- DQN engineering tricks: experience replay + target network avoid divergence in practice
- Double DQN: fixes maximization bias → better performance
- Dueling DQN: decomposes Q into V + A → better credit assignment
- PER: prioritized experience replay → more efficient use of data
Connection to DynamICCL
| Ch.11 Concept | DynamICCL Relevance |
|---|---|
| Deadly triad | PPO is on-policy → avoids deadly triad by design |
| DQN + replay buffer | DQN-based NCCL optimizer alternative to PPO |
| Target network | Stabilizes Q-target for NCCL Q-learning |
| Double DQN | Important: NCCL throughput measurements are noisy → maximization bias real |
| PER | Replay NCCL experiences with large TD error more often |
| Off-policy advantage | Could reuse old NCCL experience from past training runs |
DynamICCL uses PPO (on-policy) avoiding the deadly triad. If switching to DQN/SAC (off-policy), the engineering tricks from Ch.11 (replay, target network, double Q, PER) become critical for stable training.