Value Function Approximation
Chapter 9 — On-Policy Prediction with Approximation
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 195–212
Why Function Approximation?
Tabular methods (Ch.1–8) store V(s) or Q(s,a) in a table. - Requires |S| or |S|×|A| entries - Fails when |S| is large (e.g., 10^47 chess states) or continuous
Function approximation: represent V(s; θ) where θ is a parameter vector (weights):
V(s; θ) ≈ V^π(s) for all s ∈ S
Benefits: 1. Generalization: updating V(s; θ) affects V(s’; θ) for similar s’ — automatic generalization across states 2. Scalability: parameter count |θ| << |S| 3. Continuous states: works for s ∈ ℝ^d
Types of function approximators: - Linear: V(s; θ) = θ^T φ(s) (features φ(s) hand-designed) - Neural network: V(s; θ) = NeuralNet(s; θ) - Decision trees, radial basis functions, etc.
The Prediction Objective: VE (Value Error)
Goal: find θ to minimize the Value Error (VE):
VE(θ) = Σ_s μ(s) [V^π(s) - V(s; θ)]²
where μ(s) = state distribution under policy π (on-policy distribution).
Why weight by μ(s)? We care more about getting V(s) right for states we actually visit. Rarely-visited states contribute little to overall performance.
μ(s) for episodic tasks: the fraction of time spent in state s under π:
μ(s) = (fraction of episodes starting in or visiting s) × (fraction of time spent in s per visit)
= η(s) / Σ_{s'} η(s')
where η(s) = expected number of visits to s per episode
For continuing tasks: μ(s) = stationary distribution of the Markov chain under π.
Stochastic Gradient Descent for VE
Apply SGD to minimize VE(θ):
θ_{t+1} = θ_t - (α/2) ∇_θ [V^π(S_t) - V(S_t; θ_t)]²
= θ_t + α [V^π(S_t) - V(S_t; θ_t)] ∇_θ V(S_t; θ_t)
Problem: V^π(S_t) is unknown. Replace with an estimate U_t:
θ_{t+1} = θ_t + α [U_t - V(S_t; θ_t)] ∇_θ V(S_t; θ_t)
Different choices for U_t:
| U_t | Method | Bias | Variance | Notes |
|---|---|---|---|---|
| G_t (full return) | Monte Carlo FA | 0 | High | True gradient descent |
| R_{t+1} + γV(S_{t+1}; θ) | TD(0) FA | Yes | Low | Semi-gradient |
| G_t^{(n)} (n-step) | n-step TD FA | Less | Medium | Semi-gradient |
| G_t^λ (λ-return) | TD(λ) FA | Controlled | Controlled | Semi-gradient |
Semi-Gradient Descent: Why Not Full Gradient?
Full gradient of VE w.r.t. TD target:
-∇_θ VE = Σ_s μ(s) [V^π(s) - V(s;θ)] ∇_θ V(s;θ)
If we use U_t = R_{t+1} + γV(S_{t+1}; θ) as the target, the target depends on θ:
∂/∂θ [R_{t+1} + γV(S_{t+1}; θ) - V(S_t; θ)]²
= 2[...] [γ ∂V(S_{t+1})/∂θ - ∂V(S_t)/∂θ]
Semi-gradient: ignore the ∂V(S_{t+1})/∂θ term — treat the target as a constant:
θ ← θ + α [R_{t+1} + γV(S_{t+1}; θ) - V(S_t; θ)] ∇_θ V(S_t; θ)
Why semi-gradient? 1. The full gradient includes V(S_{t+1}) — a “moving target” creates instability 2. Semi-gradient works in practice (and has convergence proofs for linear case) 3. Full gradient TD methods exist (gradient TD, GTD2, TDC) but are complex
Semi-gradient MC is true gradient descent (U_t = G_t doesn’t depend on θ) — converges to local minimum of VE.
Algorithm: Semi-Gradient TD(0)
Input: policy π, stepsize α, feature function φ: S → ℝ^d
Initialize: θ = 0 (or small random)
Loop (episodes):
Initialize S_0
For each step:
A ← π(S_t)
Observe R_{t+1}, S_{t+1}
δ ← R_{t+1} + γ V(S_{t+1}; θ) - V(S_t; θ) ← TD error
θ ← θ + α δ ∇_θ V(S_t; θ) ← semi-gradient update
S_t ← S_{t+1}
For linear approximation V(s; θ) = θ^T φ(s):
∇_θ V(S_t; θ) = φ(S_t)
θ ← θ + α δ φ(S_t)
The update is just a rescaled feature vector step!
Linear Function Approximation
Representation: V(s; θ) = θ^T φ(s) = Σ_i θ_i φ_i(s)
Feature vector φ(s) ∈ ℝ^d: encodes relevant properties of state s.
Semi-gradient TD(0) update for linear FA:
θ ← θ + α [R_{t+1} + γ θ^T φ(S_{t+1}) - θ^T φ(S_t)] φ(S_t)
Matrix form: this is solving the linear TD fixed point equation:
A θ = b
where: A = E[φ(S_t)(φ(S_t) - γφ(S_{t+1}))^T] (d×d matrix)
b = E[R_{t+1} φ(S_t)] (d-vector)
Convergence theorem (linear TD): semi-gradient TD(0) with linear FA converges to the unique solution θ_TD where:
VE(θ_TD) ≤ 1/(1-γ) min_θ VE(θ)
The error is at most 1/(1-γ) times the best possible approximation error — this is the approximation error bound for TD.
Feature Engineering for Linear FA
Good features → good approximation. Examples:
Polynomial features (Ch.9.5.2)
φ(s) = [1, s_1, s_2, s_1², s_1 s_2, s_2², ..., s_1^n_1 · s_2^n_2 · ...]
For state s = (s_1, s_2) ∈ [0,1]²: polynomial basis up to degree n → (n+1)² features.
Fourier basis
φ_k(s) = cos(π k^T s) for multi-index k = (k_1, ..., k_d)
Better than polynomial for many problems (global support).
Tile Coding (most useful in practice)
Partition state space into overlapping tilings:
For each tiling i:
Map s → tile index t_i(s)
φ_{i,j}(s) = 1 if t_i(s) = j, else 0
Total features = num_tilings × num_tiles_per_tiling. - Coarse coding (few large tiles): good generalization, poor resolution - Fine coding (many small tiles): good resolution, poor generalization - Tile coding (overlapping tilings at different offsets): gets both!
Neural Network Approximation
Replace linear θ^T φ(s) with multi-layer neural network:
V(s; θ) = W_L · σ(W_{L-1} · σ(... σ(W_1 · s + b_1) ... + b_{L-1}) + b_L
Gradient: computed via backpropagation.
Semi-gradient TD with NN:
θ ← θ + α [R + γV(S'; θ) - V(S; θ)] ∇_θ V(S; θ)
No convergence guarantee for nonlinear FA with TD — but works in practice for DQN, PPO.
DQN improvements for stability (Mnih et al., 2015): 1. Experience replay: sample random (s,a,r,s’) from buffer instead of consecutive steps 2. Target network: use θ^- (frozen copy, updated every C steps) for TD target 3. Gradient clipping: clip gradients to prevent large updates
Convergence: Tabular vs Linear vs Neural
| Method | Convergence | Condition |
|---|---|---|
| Tabular TD(0) | V^π exactly | All states visited ∞ |
| Linear TD(0) | θ_TD (near V^π, bounded by 1/(1-γ)) | Standard conditions |
| Linear MC | Local min of VE | Standard SGD conditions |
| Neural TD (semi-grad) | Not guaranteed | May diverge (deadly triad) |
| Neural MC | Local min of VE | Standard SGD conditions |
Connection to DynamICCL
DynamICCL uses neural network value function V(s; θ) with semi-gradient TD updates:
PPO value loss: L_V(θ) = E[(V(S_t; θ) - (Â_t + V(S_t; θ_old)))²]
Gradient: θ ← θ - α ∇_θ L_V(θ)
This is semi-gradient MC (using GAE returns as U_t = Â_t + V(S_t)) with: - V(s; θ) = neural network (MLP on network state features) - Feature φ(s) = embedded representation of bandwidth, topology, queue depth, etc. - Update: semi-gradient (don’t differentiate through the target)