On-Policy Control with Function Approximation

Chapter 10 — On-Policy Control with Approximation

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 243–264

From Prediction to Control with Approximation

Prediction (Ch.9): estimate V^π(s; θ) for a fixed policy π. Control: find optimal policy π*, using function approximation for Q^π(s,a; θ).

GPI with function approximation:

Policy Evaluation: update Q(s,a; θ) toward Q^π (using TD, MC, n-step)
Policy Improvement: π(s) = argmax_a Q(s,a; θ)    ← ε-greedy

The GPI machinery is the same as tabular; only the value function representation changes.

Episodic Semi-Gradient SARSA

Q-function approximation: Q(s, a; θ) ≈ Q^π(s, a)

SARSA update with function approximation:

θ ← θ + α [R_{t+1} + γQ(S_{t+1}, A_{t+1}; θ) - Q(S_t, A_t; θ)] ∇_θ Q(S_t, A_t; θ)

Algorithm: Episodic Semi-Gradient SARSA

Initialize θ = 0

Loop (episodes):
  S ← initial state
  A ← ε-greedy(Q(S, ·; θ))

  Loop (steps):
    Execute A; observe R, S'
    If S' is terminal:
      θ ← θ + α[R - Q(S,A;θ)] ∇_θ Q(S,A;θ)   ← terminal update
      break
    A' ← ε-greedy(Q(S', ·; θ))
    θ ← θ + α[R + γQ(S',A';θ) - Q(S,A;θ)] ∇_θ Q(S,A;θ)   ← SARSA update
    S ← S'; A ← A'

Mountain Car Control Example (S&B 10.1)

Setup: same mountain car from Ch.9. Now learning the policy, not just prediction.

Q(s, a; θ) with tile coding: - State: (position, velocity) — 2D continuous - Action: {push left, coast, push right} — 3 discrete - Features: tile coding with separate tiles for each action

Semi-gradient SARSA learning:

Episode 1: random flailing, ~2000 steps to goal
Episode 100: improved, ~400 steps
Episode 1000: near-optimal, ~120 steps

Cost-to-go function V̂(s) = -max_a Q(s,a;θ) shows intuitive structure: - High cost near top-left (car needs many oscillations to escape) - Low cost near right wall (close to goal)

Average Reward Formulation for Continuing Tasks

For continuing tasks (no terminal state), the standard γ-discounted return is problematic with γ < 1 because the effective horizon limits long-term planning, and with γ = 1 returns may be infinite.

Alternative: maximize average reward rate r̄:

r̄(π) = lim_{T→∞} (1/T) E_π[Σ_{t=0}^{T-1} R_t]
      = Σ_s μ(s) Σ_a π(a|s) r(s,a)

Differential return (centered on average):

G_t = Σ_{k=0}^{∞} [R_{t+k+1} - r̄]    ← sum of deviations from average

Differential TD error:

δ_t = R_{t+1} - r̄ + V(S_{t+1}) - V(S_t)

Differential TD update (also update r̄ estimate R̄):

δ_t ← R_{t+1} - R̄ + V(S_{t+1}; θ) - V(S_t; θ)
R̄ ← R̄ + β δ_t                    ← update average reward estimate
θ ← θ + α δ_t ∇_θ V(S_t; θ)     ← update value function

When to use: - Continuing tasks with no natural episode boundary - Process control (chemical plants, robot locomotion) - DynamICCL: if job-level episodic structure isn’t natural

n-Step SARSA with Function Approximation

n-step return for Q:

G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n Q(S_{t+n}, A_{t+n}; θ)

n-step SARSA update:

θ ← θ + α [G_t^{(n)} - Q(S_t, A_t; θ)] ∇_θ Q(S_t, A_t; θ)

Benefits over 1-step SARSA: faster credit assignment, especially for sparse rewards.

Mountain car with n-step SARSA (S&B Figure 10.4): - n=1: ~400 steps to near-optimal - n=8: ~100 steps to near-optimal (4× faster)

Access Control Queuing Task (S&B 10.2)

Setup (differential SARSA): - Server with k=10 slots; customers of priority 1,2,4,8 arrive - Action: accept (reward = priority) or reject (reward = 0) - State: (free slots, current customer priority) - Continuing task, no discounting → average reward formulation

Differential SARSA with linear FA converges to near-optimal policy: - Accept high-priority customers more aggressively when many free slots - Reject low-priority customers when server is near capacity

Convergence Properties

On-policy semi-gradient SARSA with linear FA: - Converges to a solution near Q* (analogous to TD fixed point for prediction) - Error bounded: VE(θ_SARSA) ≤ something × min VE(θ) - Constant α: converges to a region around the solution

On-policy with neural networks: - No convergence guarantee - Empirically works for DQN, PPO with careful engineering

The deadly triad (full discussion in Ch.11): ON-POLICY + function approximation + bootstrapping is generally safe. OFF-POLICY + function approximation + bootstrapping can diverge.

Chapter 10 Summary

Semi-gradient SARSA extends tabular SARSA to continuous state spaces
GPI with FA: evaluation → ε-greedy improvement; same structure, approximate values
Average reward formulation handles continuing tasks without discounting
n-step SARSA speeds up learning by using longer return estimates
Mountain car: prototypical continuous-state control problem; tile coding + SARSA works

Connection to DynamICCL

DynamICCL’s policy optimization is on-policy control with function approximation: - Semi-gradient: PPO doesn’t differentiate through the value target - Q approximation: PPO uses both π(a|s; θ_π) and V(s; θ_V) — actor-critic - n-step: GAE with λ=0.95 is essentially ~20-step SARSA - Continuous state: network bandwidth, topology features → neural network Q/V - ε-exploration: PPO uses entropy bonus instead of ε-greedy; equivalent in effect