Value Function Approximation

Chapter 9 — On-Policy Prediction with Approximation

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 195–212

Why Function Approximation?

Tabular methods (Ch.1–8) store V(s) or Q(s,a) in a table. - Requires |S| or |S|×|A| entries - Fails when |S| is large (e.g., 10^47 chess states) or continuous

Function approximation: represent V(s; θ) where θ is a parameter vector (weights):

V(s; θ) ≈ V^π(s)    for all s ∈ S

Benefits: 1. Generalization: updating V(s; θ) affects V(s’; θ) for similar s’ — automatic generalization across states 2. Scalability: parameter count |θ| << |S| 3. Continuous states: works for s ∈ ℝ^d

Types of function approximators: - Linear: V(s; θ) = θ^T φ(s) (features φ(s) hand-designed) - Neural network: V(s; θ) = NeuralNet(s; θ) - Decision trees, radial basis functions, etc.

The Prediction Objective: VE (Value Error)

Goal: find θ to minimize the Value Error (VE):

VE(θ) = Σ_s μ(s) [V^π(s) - V(s; θ)]²

where μ(s) = state distribution under policy π (on-policy distribution).

Why weight by μ(s)? We care more about getting V(s) right for states we actually visit. Rarely-visited states contribute little to overall performance.

μ(s) for episodic tasks: the fraction of time spent in state s under π:

μ(s) = (fraction of episodes starting in or visiting s) × (fraction of time spent in s per visit)
     = η(s) / Σ_{s'} η(s')
where η(s) = expected number of visits to s per episode

For continuing tasks: μ(s) = stationary distribution of the Markov chain under π.

Stochastic Gradient Descent for VE

Apply SGD to minimize VE(θ):

θ_{t+1} = θ_t - (α/2) ∇_θ [V^π(S_t) - V(S_t; θ_t)]²
         = θ_t + α [V^π(S_t) - V(S_t; θ_t)] ∇_θ V(S_t; θ_t)

Problem: V^π(S_t) is unknown. Replace with an estimate U_t:

θ_{t+1} = θ_t + α [U_t - V(S_t; θ_t)] ∇_θ V(S_t; θ_t)

Different choices for U_t:

U_t	Method	Bias	Variance	Notes
G_t (full return)	Monte Carlo FA	0	High	True gradient descent
R_{t+1} + γV(S_{t+1}; θ)	TD(0) FA	Yes	Low	Semi-gradient
G_t^{(n)} (n-step)	n-step TD FA	Less	Medium	Semi-gradient
G_t^λ (λ-return)	TD(λ) FA	Controlled	Controlled	Semi-gradient

Semi-Gradient Descent: Why Not Full Gradient?

Full gradient of VE w.r.t. TD target:

-∇_θ VE = Σ_s μ(s) [V^π(s) - V(s;θ)] ∇_θ V(s;θ)

If we use U_t = R_{t+1} + γV(S_{t+1}; θ) as the target, the target depends on θ:

∂/∂θ [R_{t+1} + γV(S_{t+1}; θ) - V(S_t; θ)]²
= 2[...] [γ ∂V(S_{t+1})/∂θ - ∂V(S_t)/∂θ]

Semi-gradient: ignore the ∂V(S_{t+1})/∂θ term — treat the target as a constant:

θ ← θ + α [R_{t+1} + γV(S_{t+1}; θ) - V(S_t; θ)] ∇_θ V(S_t; θ)

Why semi-gradient? 1. The full gradient includes V(S_{t+1}) — a “moving target” creates instability 2. Semi-gradient works in practice (and has convergence proofs for linear case) 3. Full gradient TD methods exist (gradient TD, GTD2, TDC) but are complex

Semi-gradient MC is true gradient descent (U_t = G_t doesn’t depend on θ) — converges to local minimum of VE.

Algorithm: Semi-Gradient TD(0)

Input: policy π, stepsize α, feature function φ: S → ℝ^d

Initialize: θ = 0 (or small random)

Loop (episodes):
  Initialize S_0
  For each step:
    A ← π(S_t)
    Observe R_{t+1}, S_{t+1}
    δ ← R_{t+1} + γ V(S_{t+1}; θ) - V(S_t; θ)   ← TD error
    θ ← θ + α δ ∇_θ V(S_t; θ)                    ← semi-gradient update
    S_t ← S_{t+1}

For linear approximation V(s; θ) = θ^T φ(s):

∇_θ V(S_t; θ) = φ(S_t)
θ ← θ + α δ φ(S_t)

The update is just a rescaled feature vector step!

Linear Function Approximation

Representation: V(s; θ) = θ^T φ(s) = Σ_i θ_i φ_i(s)

Feature vector φ(s) ∈ ℝ^d: encodes relevant properties of state s.

Semi-gradient TD(0) update for linear FA:

θ ← θ + α [R_{t+1} + γ θ^T φ(S_{t+1}) - θ^T φ(S_t)] φ(S_t)

Matrix form: this is solving the linear TD fixed point equation:

A θ = b
where: A = E[φ(S_t)(φ(S_t) - γφ(S_{t+1}))^T]    (d×d matrix)
       b = E[R_{t+1} φ(S_t)]                       (d-vector)

Convergence theorem (linear TD): semi-gradient TD(0) with linear FA converges to the unique solution θ_TD where:

VE(θ_TD) ≤ 1/(1-γ) min_θ VE(θ)

The error is at most 1/(1-γ) times the best possible approximation error — this is the approximation error bound for TD.

Feature Engineering for Linear FA

Good features → good approximation. Examples:

Polynomial features (Ch.9.5.2)

φ(s) = [1, s_1, s_2, s_1², s_1 s_2, s_2², ..., s_1^n_1 · s_2^n_2 · ...]

For state s = (s_1, s_2) ∈ [0,1]²: polynomial basis up to degree n → (n+1)² features.

Fourier basis

φ_k(s) = cos(π k^T s)    for multi-index k = (k_1, ..., k_d)

Better than polynomial for many problems (global support).

Tile Coding (most useful in practice)

Partition state space into overlapping tilings:

For each tiling i:
  Map s → tile index t_i(s)
  φ_{i,j}(s) = 1 if t_i(s) = j, else 0

Total features = num_tilings × num_tiles_per_tiling. - Coarse coding (few large tiles): good generalization, poor resolution - Fine coding (many small tiles): good resolution, poor generalization - Tile coding (overlapping tilings at different offsets): gets both!

Neural Network Approximation

Replace linear θ^T φ(s) with multi-layer neural network:

V(s; θ) = W_L · σ(W_{L-1} · σ(... σ(W_1 · s + b_1) ... + b_{L-1}) + b_L

Gradient: computed via backpropagation.

Semi-gradient TD with NN:

θ ← θ + α [R + γV(S'; θ) - V(S; θ)] ∇_θ V(S; θ)

No convergence guarantee for nonlinear FA with TD — but works in practice for DQN, PPO.

DQN improvements for stability (Mnih et al., 2015): 1. Experience replay: sample random (s,a,r,s’) from buffer instead of consecutive steps 2. Target network: use θ^- (frozen copy, updated every C steps) for TD target 3. Gradient clipping: clip gradients to prevent large updates

Convergence: Tabular vs Linear vs Neural

Method	Convergence	Condition
Tabular TD(0)	V^π exactly	All states visited ∞
Linear TD(0)	θ_TD (near V^π, bounded by 1/(1-γ))	Standard conditions
Linear MC	Local min of VE	Standard SGD conditions
Neural TD (semi-grad)	Not guaranteed	May diverge (deadly triad)
Neural MC	Local min of VE	Standard SGD conditions

Connection to DynamICCL

DynamICCL uses neural network value function V(s; θ) with semi-gradient TD updates:

PPO value loss: L_V(θ) = E[(V(S_t; θ) - (Â_t + V(S_t; θ_old)))²]
Gradient:      θ ← θ - α ∇_θ L_V(θ)

This is semi-gradient MC (using GAE returns as U_t = Â_t + V(S_t)) with: - V(s; θ) = neural network (MLP on network state features) - Feature φ(s) = embedded representation of bandwidth, topology, queue depth, etc. - Update: semi-gradient (don’t differentiate through the target)