On-Policy Control with Function Approximation
Chapter 10 — On-Policy Control with Approximation
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 243–264
From Prediction to Control with Approximation
Prediction (Ch.9): estimate V^π(s; θ) for a fixed policy π. Control: find optimal policy π*, using function approximation for Q^π(s,a; θ).
GPI with function approximation:
Policy Evaluation: update Q(s,a; θ) toward Q^π (using TD, MC, n-step)
Policy Improvement: π(s) = argmax_a Q(s,a; θ) ← ε-greedy
The GPI machinery is the same as tabular; only the value function representation changes.
Episodic Semi-Gradient SARSA
Q-function approximation: Q(s, a; θ) ≈ Q^π(s, a)
SARSA update with function approximation:
θ ← θ + α [R_{t+1} + γQ(S_{t+1}, A_{t+1}; θ) - Q(S_t, A_t; θ)] ∇_θ Q(S_t, A_t; θ)
Algorithm: Episodic Semi-Gradient SARSA
Initialize θ = 0
Loop (episodes):
S ← initial state
A ← ε-greedy(Q(S, ·; θ))
Loop (steps):
Execute A; observe R, S'
If S' is terminal:
θ ← θ + α[R - Q(S,A;θ)] ∇_θ Q(S,A;θ) ← terminal update
break
A' ← ε-greedy(Q(S', ·; θ))
θ ← θ + α[R + γQ(S',A';θ) - Q(S,A;θ)] ∇_θ Q(S,A;θ) ← SARSA update
S ← S'; A ← A'
Mountain Car Control Example (S&B 10.1)
Setup: same mountain car from Ch.9. Now learning the policy, not just prediction.
Q(s, a; θ) with tile coding: - State: (position, velocity) — 2D continuous - Action: {push left, coast, push right} — 3 discrete - Features: tile coding with separate tiles for each action
Semi-gradient SARSA learning:
Episode 1: random flailing, ~2000 steps to goal
Episode 100: improved, ~400 steps
Episode 1000: near-optimal, ~120 steps
Cost-to-go function V̂(s) = -max_a Q(s,a;θ) shows intuitive structure: - High cost near top-left (car needs many oscillations to escape) - Low cost near right wall (close to goal)
Average Reward Formulation for Continuing Tasks
For continuing tasks (no terminal state), the standard γ-discounted return is problematic with γ < 1 because the effective horizon limits long-term planning, and with γ = 1 returns may be infinite.
Alternative: maximize average reward rate r̄:
r̄(π) = lim_{T→∞} (1/T) E_π[Σ_{t=0}^{T-1} R_t]
= Σ_s μ(s) Σ_a π(a|s) r(s,a)
Differential return (centered on average):
G_t = Σ_{k=0}^{∞} [R_{t+k+1} - r̄] ← sum of deviations from average
Differential TD error:
δ_t = R_{t+1} - r̄ + V(S_{t+1}) - V(S_t)
Differential TD update (also update r̄ estimate R̄):
δ_t ← R_{t+1} - R̄ + V(S_{t+1}; θ) - V(S_t; θ)
R̄ ← R̄ + β δ_t ← update average reward estimate
θ ← θ + α δ_t ∇_θ V(S_t; θ) ← update value function
When to use: - Continuing tasks with no natural episode boundary - Process control (chemical plants, robot locomotion) - DynamICCL: if job-level episodic structure isn’t natural
n-Step SARSA with Function Approximation
n-step return for Q:
G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n Q(S_{t+n}, A_{t+n}; θ)
n-step SARSA update:
θ ← θ + α [G_t^{(n)} - Q(S_t, A_t; θ)] ∇_θ Q(S_t, A_t; θ)
Benefits over 1-step SARSA: faster credit assignment, especially for sparse rewards.
Mountain car with n-step SARSA (S&B Figure 10.4): - n=1: ~400 steps to near-optimal - n=8: ~100 steps to near-optimal (4× faster)
Access Control Queuing Task (S&B 10.2)
Setup (differential SARSA): - Server with k=10 slots; customers of priority 1,2,4,8 arrive - Action: accept (reward = priority) or reject (reward = 0) - State: (free slots, current customer priority) - Continuing task, no discounting → average reward formulation
Differential SARSA with linear FA converges to near-optimal policy: - Accept high-priority customers more aggressively when many free slots - Reject low-priority customers when server is near capacity
Convergence Properties
On-policy semi-gradient SARSA with linear FA: - Converges to a solution near Q* (analogous to TD fixed point for prediction) - Error bounded: VE(θ_SARSA) ≤ something × min VE(θ) - Constant α: converges to a region around the solution
On-policy with neural networks: - No convergence guarantee - Empirically works for DQN, PPO with careful engineering
The deadly triad (full discussion in Ch.11): ON-POLICY + function approximation + bootstrapping is generally safe. OFF-POLICY + function approximation + bootstrapping can diverge.
Chapter 10 Summary
- Semi-gradient SARSA extends tabular SARSA to continuous state spaces
- GPI with FA: evaluation → ε-greedy improvement; same structure, approximate values
- Average reward formulation handles continuing tasks without discounting
- n-step SARSA speeds up learning by using longer return estimates
- Mountain car: prototypical continuous-state control problem; tile coding + SARSA works
Connection to DynamICCL
DynamICCL’s policy optimization is on-policy control with function approximation: - Semi-gradient: PPO doesn’t differentiate through the value target - Q approximation: PPO uses both π(a|s; θ_π) and V(s; θ_V) — actor-critic - n-step: GAE with λ=0.95 is essentially ~20-step SARSA - Continuous state: network bandwidth, topology features → neural network Q/V - ε-exploration: PPO uses entropy bonus instead of ε-greedy; equivalent in effect