Linear TD Convergence and On-Policy Approximation Theory
Chapter 9 — On-Policy Prediction with Approximation
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 208–228
The Linear TD Fixed Point
Semi-gradient TD(0) with linear FA converges to a unique fixed point θ_TD that satisfies:
A θ_TD = b
where:
A = E_μ[φ(S_t)(φ(S_t) - γφ(S_{t+1}))^T]
b = E_μ[R_{t+1} φ(S_t)]
The TD fixed point θ_TD is NOT the minimum of VE(θ) in general. It’s the solution to a different (but related) equation.
Key theorem (Tsitsiklis & Van Roy, 1997):
If: 1. φ is a basis (rank d ≤ |S|) 2. μ = on-policy distribution under π 3. A is positive definite (sufficient: γ < 1)
Then: - The TD fixed point θ_TD exists and is unique - VE(θ_TD) ≤ 1/(1-γ) · min_θ VE(θ)
The “1/(1-γ) bound” means TD can be at most 1/(1-γ) times worse than the best linear approximation. For γ=0.99: up to 100× worse!
Interpretation: the bound is loose in practice — TD typically performs close to the best approximation. But for γ near 1 and with bootstrapping, errors can compound.
Why TD ≠ Gradient Descent on VE
VE minimum: θ* = argmin_θ VE(θ) = argmin Σ_s μ(s)[V^π(s) - θ^T φ(s)]²
This is just linear regression: θ* = (Φ^T D Φ)^{-1} Φ^T D V^π
where Φ = matrix of feature vectors [φ(s₁), φ(s₂), …]^T, D = diag(μ).
TD fixed point: θ_TD solves Φ^T D (V^π - Φθ_TD) = Φ^T D γ P Φ θ_TD
Rearranging: Φ^T D(I - γP) Φ θ_TD = Φ^T D V^π → different from VE minimum.
Geometric interpretation:
VE minimum: orthogonal projection of V^π onto subspace span(Φ)
TD fixed point: "oblique projection" that accounts for bootstrap structure
The TD fixed point satisfies a “Bellman consistency” condition within the subspace — it’s the value function that the Bellman operator would approximately return to.
Convergence Rate and Step Size
Rate of convergence (constant stepsize α):
||θ_t - θ_TD||² → C · α / (2σ_min(A)) as t → ∞
where σ_min(A) is the smallest eigenvalue of A, and C is a problem-dependent constant.
To achieve error ε: - Convergence region: α ≤ ε · 2σ_min(A) / C - Convergence time: O(1/α) steps to reach the ε-neighborhood
Practical step size selection: - Too large α: oscillates around θ_TD, doesn’t converge - Too small α: converges accurately but slowly - Rule of thumb: α = 0.1/||φ(s)||² (scale by feature magnitude)
Least Squares TD (LSTD)
LSTD computes the TD fixed point directly by solving A θ = b:
Â_t = (1/t) Σ_{k=0}^{t-1} φ(S_k)(φ(S_k) - γφ(S_{k+1}))^T
b̂_t = (1/t) Σ_{k=0}^{t-1} R_{k+1} φ(S_k)
θ_LSTD = Â_t^{-1} b̂_t
Properties: - More sample efficient: uses all data optimally (no step-size tuning needed) - More expensive: O(d³) per update (matrix inversion) vs O(d) for TD(0) - Memory: O(d²) for storing Â_t
Incremental (Sherman-Morrison update):
Â_t^{-1} ← Â_{t-1}^{-1} - [Â_{t-1}^{-1} φ_t (φ_t - γφ_t')^T Â_{t-1}^{-1}] / [1 + (φ_t - γφ_t')^T Â_{t-1}^{-1} φ_t]
O(d²) per update — better than O(d³) from scratch.
LSTD vs TD(0): - LSTD: more sample efficient, less computation efficient - TD(0): simpler, scales to large d (neural networks) - In practice: TD(0) for deep learning, LSTD for linear FA with d ≤ 1000
Coarse Coding and Generalization
Coarse coding: each “tile” activates over a region of state space. When V(s;θ) is updated, neighboring states (those sharing tiles) also get updated.
Example: 2D state space s = (x, y)
Tiles: circles of radius r centered at grid points
φ_i(s) = 1 if s is inside circle i, else 0
Generalization width = tile size r: - Large r: strong generalization (neighboring states very similar) - Small r: weak generalization (resolution is high)
Receptive field: for state s, the “receptive field” is all tiles it activates. Updates to s affect all states sharing those tiles.
Tile coding: multiple overlapping codings at different offsets + resolutions:
n_tilings tilings × tiles_per_dim^d tiles_per_tiling = total features
Practical rule: n_tilings ≈ 4d (4 × state dimensionality); tiles chosen for desired resolution.
Radial Basis Functions (RBFs)
φ_i(s) = exp(-||s - c_i||² / (2σ_i²))
Center c_i, width σ_i. V(s;θ) = θ^T φ(s) = smooth function over state space.
Compared to tile coding: - RBFs: differentiable, smooth generalization - Tile coding: piece-wise constant, efficient binary representation
In neural networks: first layer = learned RBF-like features.
The Mountain Car Example (S&B 9.6.2)
Setup: - Car in a valley; must drive up a hill - State: (position, velocity) ∈ [-1.2, 0.6] × [-0.07, 0.07] - Actions: accelerate left/right/coasting - Reward: -1 per step; terminal when reach top - γ = 1; typical episode ≈ 100-1000 steps
Linear FA with tile coding (8 tilings, 8×8 grid per tiling): - 512 features total - Converges to near-optimal policy in ~400 episodes - Better than tabular (which would need a fine-grained discrete grid, slower to explore)
Policy quality (under converged policy): - Car learns to rock back and forth to gain momentum - Reaches goal in ~120 steps (near-optimal)
Connection to neural networks: tile coding features = fixed first layer; neural networks learn the features adaptively.
Chapter 9 Summary
- VE = weighted MSE between V(s;θ) and V^π(s), weighted by on-policy distribution μ
- Semi-gradient TD(0) converges for linear FA; not guaranteed for neural networks
- Linear TD fixed point θ_TD ≠ VE minimum; bounded error VE(θ_TD) ≤ VE*/(1-γ)
- LSTD: solves for θ_TD directly; sample efficient but O(d²) per step
- Tile coding: practical feature design for low-dimensional continuous state spaces
- Generalization: function approximation provides automatic generalization; tile size controls its scale
Connection to DynamICCL
| Ch.9 Concept | DynamICCL Implementation |
|---|---|
| VE minimization | PPO value loss MSE minimization |
| Semi-gradient TD | Semi-gradient PPO value update (detach target) |
| μ(s) on-policy | States visited during NCCL rollouts |
| Tile coding features | Discretized bandwidth × topology feature map |
| Neural FA | MLP: φ(s) = embedded(network_state) |
| LSTD | Not used (d too large for matrix inversion) |
| Generalization | NN generalizes across similar network states/configurations |