Linear TD Convergence and On-Policy Approximation Theory

Chapter 9 — On-Policy Prediction with Approximation

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 208–228

The Linear TD Fixed Point

Semi-gradient TD(0) with linear FA converges to a unique fixed point θ_TD that satisfies:

A θ_TD = b

where:
  A = E_μ[φ(S_t)(φ(S_t) - γφ(S_{t+1}))^T]
  b = E_μ[R_{t+1} φ(S_t)]

The TD fixed point θ_TD is NOT the minimum of VE(θ) in general. It’s the solution to a different (but related) equation.

Key theorem (Tsitsiklis & Van Roy, 1997):

If: 1. φ is a basis (rank d ≤ |S|) 2. μ = on-policy distribution under π 3. A is positive definite (sufficient: γ < 1)

Then: - The TD fixed point θ_TD exists and is unique - VE(θ_TD) ≤ 1/(1-γ) · min_θ VE(θ)

The “1/(1-γ) bound” means TD can be at most 1/(1-γ) times worse than the best linear approximation. For γ=0.99: up to 100× worse!

Interpretation: the bound is loose in practice — TD typically performs close to the best approximation. But for γ near 1 and with bootstrapping, errors can compound.

Why TD ≠ Gradient Descent on VE

VE minimum: θ* = argmin_θ VE(θ) = argmin Σ_s μ(s)[V^π(s) - θ^T φ(s)]²

This is just linear regression: θ* = (Φ^T D Φ)^{-1} Φ^T D V^π

where Φ = matrix of feature vectors [φ(s₁), φ(s₂), …]^T, D = diag(μ).

TD fixed point: θ_TD solves Φ^T D (V^π - Φθ_TD) = Φ^T D γ P Φ θ_TD

Rearranging: Φ^T D(I - γP) Φ θ_TD = Φ^T D V^π → different from VE minimum.

Geometric interpretation:

VE minimum: orthogonal projection of V^π onto subspace span(Φ)
TD fixed point: "oblique projection" that accounts for bootstrap structure

The TD fixed point satisfies a “Bellman consistency” condition within the subspace — it’s the value function that the Bellman operator would approximately return to.

Convergence Rate and Step Size

Rate of convergence (constant stepsize α):

||θ_t - θ_TD||² → C · α / (2σ_min(A))   as t → ∞

where σ_min(A) is the smallest eigenvalue of A, and C is a problem-dependent constant.

To achieve error ε: - Convergence region: α ≤ ε · 2σ_min(A) / C - Convergence time: O(1/α) steps to reach the ε-neighborhood

Practical step size selection: - Too large α: oscillates around θ_TD, doesn’t converge - Too small α: converges accurately but slowly - Rule of thumb: α = 0.1/||φ(s)||² (scale by feature magnitude)

Least Squares TD (LSTD)

LSTD computes the TD fixed point directly by solving A θ = b:

Â_t = (1/t) Σ_{k=0}^{t-1} φ(S_k)(φ(S_k) - γφ(S_{k+1}))^T
b̂_t = (1/t) Σ_{k=0}^{t-1} R_{k+1} φ(S_k)

θ_LSTD = Â_t^{-1} b̂_t

Properties: - More sample efficient: uses all data optimally (no step-size tuning needed) - More expensive: O(d³) per update (matrix inversion) vs O(d) for TD(0) - Memory: O(d²) for storing Â_t

Incremental (Sherman-Morrison update):

Â_t^{-1} ← Â_{t-1}^{-1} - [Â_{t-1}^{-1} φ_t (φ_t - γφ_t')^T Â_{t-1}^{-1}] / [1 + (φ_t - γφ_t')^T Â_{t-1}^{-1} φ_t]

O(d²) per update — better than O(d³) from scratch.

LSTD vs TD(0): - LSTD: more sample efficient, less computation efficient - TD(0): simpler, scales to large d (neural networks) - In practice: TD(0) for deep learning, LSTD for linear FA with d ≤ 1000

Coarse Coding and Generalization

Coarse coding: each “tile” activates over a region of state space. When V(s;θ) is updated, neighboring states (those sharing tiles) also get updated.

Example: 2D state space s = (x, y)
Tiles: circles of radius r centered at grid points
φ_i(s) = 1 if s is inside circle i, else 0

Generalization width = tile size r: - Large r: strong generalization (neighboring states very similar) - Small r: weak generalization (resolution is high)

Receptive field: for state s, the “receptive field” is all tiles it activates. Updates to s affect all states sharing those tiles.

Tile coding: multiple overlapping codings at different offsets + resolutions:

n_tilings tilings × tiles_per_dim^d tiles_per_tiling = total features

Practical rule: n_tilings ≈ 4d (4 × state dimensionality); tiles chosen for desired resolution.

Radial Basis Functions (RBFs)

φ_i(s) = exp(-||s - c_i||² / (2σ_i²))

Center c_i, width σ_i. V(s;θ) = θ^T φ(s) = smooth function over state space.

Compared to tile coding: - RBFs: differentiable, smooth generalization - Tile coding: piece-wise constant, efficient binary representation

In neural networks: first layer = learned RBF-like features.

The Mountain Car Example (S&B 9.6.2)

Setup: - Car in a valley; must drive up a hill - State: (position, velocity) ∈ [-1.2, 0.6] × [-0.07, 0.07] - Actions: accelerate left/right/coasting - Reward: -1 per step; terminal when reach top - γ = 1; typical episode ≈ 100-1000 steps

Linear FA with tile coding (8 tilings, 8×8 grid per tiling): - 512 features total - Converges to near-optimal policy in ~400 episodes - Better than tabular (which would need a fine-grained discrete grid, slower to explore)

Policy quality (under converged policy): - Car learns to rock back and forth to gain momentum - Reaches goal in ~120 steps (near-optimal)

Connection to neural networks: tile coding features = fixed first layer; neural networks learn the features adaptively.

Chapter 9 Summary

VE = weighted MSE between V(s;θ) and V^π(s), weighted by on-policy distribution μ
Semi-gradient TD(0) converges for linear FA; not guaranteed for neural networks
Linear TD fixed point θ_TD ≠ VE minimum; bounded error VE(θ_TD) ≤ VE*/（1-γ)
LSTD: solves for θ_TD directly; sample efficient but O(d²) per step
Tile coding: practical feature design for low-dimensional continuous state spaces
Generalization: function approximation provides automatic generalization; tile size controls its scale

Connection to DynamICCL

Ch.9 Concept	DynamICCL Implementation
VE minimization	PPO value loss MSE minimization
Semi-gradient TD	Semi-gradient PPO value update (detach target)
μ(s) on-policy	States visited during NCCL rollouts
Tile coding features	Discretized bandwidth × topology feature map
Neural FA	MLP: φ(s) = embedded(network_state)
LSTD	Not used (d too large for matrix inversion)
Generalization	NN generalizes across similar network states/configurations