Off-Policy Methods with Function Approximation

Chapter 11 — Off-Policy Methods with Approximation

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 265–288


The Challenge of Off-Policy + Function Approximation

Off-policy learning (e.g., Q-learning) is powerful: learn optimal policy while following exploratory behavior. But combining with function approximation creates instability.

The Deadly Triad (S&B §11.3): three factors that together cause divergence: 1. Function approximation (generalization across states) 2. Bootstrapping (TD-style target uses estimated V) 3. Off-policy learning (data distribution ≠ target policy distribution)

Any two are fine; all three together → potential divergence.

Baird’s Counterexample (S&B §11.2): a 7-state MDP where semi-gradient Q-learning with linear FA diverges to infinity. Not a degenerate example — illustrates a fundamental instability.


Semi-Gradient Q-Learning with FA

Update rule:

θ ← θ + α [R_{t+1} + γ max_{a'} Q(S_{t+1}, a'; θ) - Q(S_t, A_t; θ)] ∇_θ Q(S_t, A_t; θ)

Problem: not true gradient descent — the target max Q(S_{t+1}, a’; θ) depends on θ.

Baird’s counterexample: 6 dashed states + 1 terminal. Behavior policy visits all states equally; semi-gradient Q-learning diverges regardless of step size.


Gradient TD Methods: True Stability

Gradient TD methods perform true stochastic gradient descent on a modified objective.

Mean Squared Projected Bellman Error (MSPBE):

MSPBE(θ) = ||V_θ - Π T^π V_θ||²_μ

where Π = projection onto the space of linear approximations, T^π = Bellman operator.

Key difference from VE: MSPBE measures how much the approximation violates the Bellman equation in the projected (function approximation) subspace.

GTD2 (Gradient TD, version 2):

δ_t = R_{t+1} + γθ^T φ_{t+1} - θ^T φ_t    ← TD error
w ← w + β_w (δ_t - w^T φ_t) φ_t           ← second weight vector w
θ ← θ + α_θ (φ_t - γ φ_{t+1}) w^T φ_t    ← gradient update

TDC (TD with Correction):

δ_t = R_{t+1} + γθ^T φ_{t+1} - θ^T φ_t
w ← w + β(δ_t - w^T φ_t) φ_t
θ ← θ + α(δ_t φ_t - γ φ_{t+1} w^T φ_t)  ← corrected update

Properties of GTD2/TDC: - True gradient descent on MSPBE → convergence guaranteed - Requires second weight vector w (double the memory of standard TD) - Slower than semi-gradient TD in practice (lower variance helps, but extra computation) - Converges even with function approximation + bootstrapping + off-policy


Emphatic TD

Alternative stable off-policy method: reweight state distribution to match on-policy distribution.

Emphasis: M_t = λM_{t-1} + I_t (interest function + trace)

Emphatic TD(0):

M_t = γλ M_{t-1} + 1   (with I = 1 uniform interest)
θ ← θ + α M_t ρ_t δ_t φ_t

where ρ_t = π(A_t|S_t)/b(A_t|S_t) is the IS ratio.

Emphatic TD multiplies each update by the emphasis M_t, effectively “restoring” the on-policy distribution weighting even when learning off-policy.

Convergence: proven for linear FA; slower than standard TD in practice.


DQN: Pragmatic Solution to the Deadly Triad

DQN (Mnih et al., 2015) doesn’t fix the deadly triad theoretically — it works around it with engineering:

Experience Replay

Replay buffer D = {(s,a,r,s')}  (capacity N = 10^6)

At each step:
  Store (s_t, a_t, r_t, s_{t+1}) in D
  Sample random minibatch {(s_j, a_j, r_j, s'_j)} from D
  Compute target: y_j = r_j + γ max_a Q(s'_j, a; θ^-)
  Update: θ ← θ + α (y_j - Q(s_j, a_j; θ)) ∇_θ Q(s_j, a_j; θ)

Benefits: - Breaks temporal correlations (consecutive experiences are highly correlated → unstable training) - Each experience used multiple times → sample efficient - Random minibatch ≈ i.i.d. samples → closer to standard SGD assumptions

Target Network

θ^-: target parameters (frozen)
Every C = 10,000 steps: θ^- ← θ    ← hard update

Or soft update (Polyak averaging):

θ^- ← τθ + (1-τ)θ^-    (τ = 0.005 typically)

Benefits: stabilizes the bootstrap target; θ^- changes slowly → TD target is more stationary.


Double DQN

Standard DQN target: R + γ max_a Q(S’, a; θ^-)

Problem: max over Q(S’, a; θ^-) is biased high (maximization bias from Ch.6).

Double DQN target:

y = R + γ Q(S', argmax_a Q(S', a; θ); θ^-)

Use current network θ to select action, target network θ^- to evaluate.

This decorrelates selection and evaluation → removes maximization bias.

Empirical result (van Hasselt et al., 2016): Double DQN outperforms DQN on 41/49 Atari games, especially those where overestimation causes instability.


Dueling DQN

Decompose Q into value V and advantage A:

Q(s, a; θ) = V(s; θ_V) + A(s, a; θ_A) - mean_a' A(s, a'; θ_A)

(Subtract mean advantage to ensure identifiability: A’s sum to 0.)

Benefits: - V(s) is updated for every action (not just the chosen one) - Better for tasks where many actions have similar Q-values


Prioritized Experience Replay (PER)

Insight: replay transitions with large TD error more frequently.

Priority: p_i = |δ_i| + ε   (TD error magnitude)
Sample probability: P(i) ∝ p_i^α   (α=0.6 typical)
IS correction: w_i = (1 / (N P(i)))^β   (correct for sampling bias)

Benefits: more learning signal per sample; faster convergence.


Chapter 11 Summary

  1. Deadly triad: FA + bootstrapping + off-policy → divergence risk
  2. Gradient TD (GTD2, TDC): true gradient descent → stable, but complex and slower
  3. Emphatic TD: re-weights samples to approximate on-policy distribution
  4. DQN engineering tricks: experience replay + target network avoid divergence in practice
  5. Double DQN: fixes maximization bias → better performance
  6. Dueling DQN: decomposes Q into V + A → better credit assignment
  7. PER: prioritized experience replay → more efficient use of data

Connection to DynamICCL

Ch.11 Concept DynamICCL Relevance
Deadly triad PPO is on-policy → avoids deadly triad by design
DQN + replay buffer DQN-based NCCL optimizer alternative to PPO
Target network Stabilizes Q-target for NCCL Q-learning
Double DQN Important: NCCL throughput measurements are noisy → maximization bias real
PER Replay NCCL experiences with large TD error more often
Off-policy advantage Could reuse old NCCL experience from past training runs

DynamICCL uses PPO (on-policy) avoiding the deadly triad. If switching to DQN/SAC (off-policy), the engineering tricks from Ch.11 (replay, target network, double Q, PER) become critical for stable training.