SARSA: On-Policy TD Control

Chapter 6 — Temporal-Difference Learning

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 129–133


From Prediction to Control

TD(0) predicts V^π for a fixed policy. For control, we need to estimate Q^π(s,a) and improve the policy.

Why Q instead of V? To improve π from V^π, we need model p(s’|s,a). To improve from Q^π, we just take argmax_a Q^π(s,a) — no model needed.

TD control: estimate Q(s,a) online using TD updates, improve policy simultaneously.


The SARSA Update

Name: SARSA comes from the quintuple (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) used in each update.

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]

TD error for SARSA:

δ_t = R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)

Key feature: uses A_{t+1} — the next action actually taken. This means SARSA uses the same policy to both generate experience AND estimate future value → on-policy.


Algorithm: SARSA

Input: stepsize α ∈ (0,1], ε ∈ (0,1)

Initialize: Q(s,a) = 0 for all s ∈ S, a ∈ A; Q(terminal, ·) = 0

Loop (episodes):
  Initialize S
  A ← ε-greedy(Q, S)    ← choose initial action

  Loop (steps in episode):
    Take action A; observe R, S'
    A' ← ε-greedy(Q, S')             ← choose next action from S'
    Q(S, A) ← Q(S, A) + α[R + γQ(S', A') - Q(S, A)]   ← SARSA update
    S ← S'; A ← A'
  Until S is terminal

ε-greedy:

A = argmax_a Q(S, a)   with probability 1-ε
A = random action      with probability ε

Why On-Policy Matters

On-policy: A_{t+1} is chosen by the same ε-greedy policy that chose A_t. The bootstrap target Q(S_{t+1}, A_{t+1}) uses the actual next action, accounting for future exploration.

Consequence: SARSA learns Q^{π_ε} — the value of the ε-greedy policy. It’s cautious: it “knows” that future actions will be ε-exploratory, so it doesn’t assume it will always take the greedy action.

This matters in dangerous environments: - If falling off a cliff gives reward = -100, SARSA will avoid the cliff edge because Q accounts for the ε chance of taking a random action (possibly falling) - Q-learning (off-policy) would walk right along the cliff edge if greedy action is safe


Convergence of SARSA

Theorem: SARSA converges to Q* (and hence π*) with probability 1 if: 1. All (s,a) pairs are visited infinitely often (ensured by ε-greedy with ε → 0 slow enough) 2. Policy converges in the limit to a greedy policy: ε_t → 0 (e.g., ε_t = 1/t) 3. Stepsizes satisfy Robbins-Monro: Σ α_t = ∞, Σ α_t² < ∞

Practical use: fixed ε (e.g., 0.1) converges to Q^{π_ε}, not Q. Decrease ε over time to find Q.


Worked Example: Windy Gridworld (S&B Example 6.5)

Setup:

7 × 10 grid; agent starts at (3,0); goal at (3,7)
Reward: -1 per step
γ = 1

Wind: each column has a crosswind (0–2 cells per step):
Column:   0  1  2  3  4  5  6  7  8  9
Upwind:   0  0  0  1  1  1  2  2  1  0

When moving in column c, agent is pushed upward by wind[c] extra cells.

SARSA learns: - Policy converges to near-optimal route through windy columns - Value function V ≈ -(Manhattan distance accounting for wind) - Typical optimal path: ~15 steps (faster than naive Manhattan distance due to using wind)

Learning curve: episodes per timestep shows rapid improvement around step ~10,000 (agent discovers wind-assisted path).


SARSA with Eligibility Traces: SARSA(λ)

Standard SARSA uses 1-step return. SARSA(λ) uses λ-weighted multi-step returns (preview of Ch.12):

Q(S_t, A_t) ← Q(S_t, A_t) + α δ_t e_t(S_t, A_t)

Eligibility trace update:
  e_{t+1}(s, a) = γλ e_t(s, a)                     ← all (s,a)
  e_{t+1}(S_t, A_t) += 1                            ← current (s,a)

TD error:
  δ_t = R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)

Effect: credit is spread backward through recent (s,a) pairs via decaying traces. λ=0: SARSA(0) = standard SARSA; λ=1: SARSA(1) ≈ MC control.


Comparison: SARSA vs Other On-Policy Methods

MC Control TD(0) Prediction SARSA
Updates Q? Yes No (updates V) Yes
Online? No (episode end) Yes Yes
Bootstrap? No Yes Yes
Policy? ε-greedy Fixed π ε-greedy (on-policy)
Target includes? Full return G_t γV(S_{t+1}) γQ(S_{t+1},A_{t+1})

Connection to DynamICCL

SARSA is the on-policy Q-learning analog in DynamICCL: - PPO’s policy gradient is on-policy (uses current policy’s rollouts) - The advantage A(s_t,a_t) = R_{t+1} + γV(S_{t+1}) - V(S_t) is exactly the SARSA TD error applied to the value network - SARSA’s caution around exploration (accounting for ε in future Q estimates) mirrors PPO’s conservative update (small KL divergence from old policy)

If DynamICCL used a tabular setting, SARSA would be the natural on-policy control algorithm. In the continuous state space, SARSA generalizes to neural network Q-functions with online gradient updates (known as online actor-critic or DQN variants).