SARSA: On-Policy TD Control
Chapter 6 — Temporal-Difference Learning
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 129–133
From Prediction to Control
TD(0) predicts V^π for a fixed policy. For control, we need to estimate Q^π(s,a) and improve the policy.
Why Q instead of V? To improve π from V^π, we need model p(s’|s,a). To improve from Q^π, we just take argmax_a Q^π(s,a) — no model needed.
TD control: estimate Q(s,a) online using TD updates, improve policy simultaneously.
The SARSA Update
Name: SARSA comes from the quintuple (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) used in each update.
Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
TD error for SARSA:
δ_t = R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)
Key feature: uses A_{t+1} — the next action actually taken. This means SARSA uses the same policy to both generate experience AND estimate future value → on-policy.
Algorithm: SARSA
Input: stepsize α ∈ (0,1], ε ∈ (0,1)
Initialize: Q(s,a) = 0 for all s ∈ S, a ∈ A; Q(terminal, ·) = 0
Loop (episodes):
Initialize S
A ← ε-greedy(Q, S) ← choose initial action
Loop (steps in episode):
Take action A; observe R, S'
A' ← ε-greedy(Q, S') ← choose next action from S'
Q(S, A) ← Q(S, A) + α[R + γQ(S', A') - Q(S, A)] ← SARSA update
S ← S'; A ← A'
Until S is terminal
ε-greedy:
A = argmax_a Q(S, a) with probability 1-ε
A = random action with probability ε
Why On-Policy Matters
On-policy: A_{t+1} is chosen by the same ε-greedy policy that chose A_t. The bootstrap target Q(S_{t+1}, A_{t+1}) uses the actual next action, accounting for future exploration.
Consequence: SARSA learns Q^{π_ε} — the value of the ε-greedy policy. It’s cautious: it “knows” that future actions will be ε-exploratory, so it doesn’t assume it will always take the greedy action.
This matters in dangerous environments: - If falling off a cliff gives reward = -100, SARSA will avoid the cliff edge because Q accounts for the ε chance of taking a random action (possibly falling) - Q-learning (off-policy) would walk right along the cliff edge if greedy action is safe
Convergence of SARSA
Theorem: SARSA converges to Q* (and hence π*) with probability 1 if: 1. All (s,a) pairs are visited infinitely often (ensured by ε-greedy with ε → 0 slow enough) 2. Policy converges in the limit to a greedy policy: ε_t → 0 (e.g., ε_t = 1/t) 3. Stepsizes satisfy Robbins-Monro: Σ α_t = ∞, Σ α_t² < ∞
Practical use: fixed ε (e.g., 0.1) converges to Q^{π_ε}, not Q. Decrease ε over time to find Q.
Worked Example: Windy Gridworld (S&B Example 6.5)
Setup:
7 × 10 grid; agent starts at (3,0); goal at (3,7)
Reward: -1 per step
γ = 1
Wind: each column has a crosswind (0–2 cells per step):
Column: 0 1 2 3 4 5 6 7 8 9
Upwind: 0 0 0 1 1 1 2 2 1 0
When moving in column c, agent is pushed upward by wind[c] extra cells.
SARSA learns: - Policy converges to near-optimal route through windy columns - Value function V ≈ -(Manhattan distance accounting for wind) - Typical optimal path: ~15 steps (faster than naive Manhattan distance due to using wind)
Learning curve: episodes per timestep shows rapid improvement around step ~10,000 (agent discovers wind-assisted path).
SARSA with Eligibility Traces: SARSA(λ)
Standard SARSA uses 1-step return. SARSA(λ) uses λ-weighted multi-step returns (preview of Ch.12):
Q(S_t, A_t) ← Q(S_t, A_t) + α δ_t e_t(S_t, A_t)
Eligibility trace update:
e_{t+1}(s, a) = γλ e_t(s, a) ← all (s,a)
e_{t+1}(S_t, A_t) += 1 ← current (s,a)
TD error:
δ_t = R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)
Effect: credit is spread backward through recent (s,a) pairs via decaying traces. λ=0: SARSA(0) = standard SARSA; λ=1: SARSA(1) ≈ MC control.
Comparison: SARSA vs Other On-Policy Methods
| MC Control | TD(0) Prediction | SARSA | |
|---|---|---|---|
| Updates Q? | Yes | No (updates V) | Yes |
| Online? | No (episode end) | Yes | Yes |
| Bootstrap? | No | Yes | Yes |
| Policy? | ε-greedy | Fixed π | ε-greedy (on-policy) |
| Target includes? | Full return G_t | γV(S_{t+1}) | γQ(S_{t+1},A_{t+1}) |
Connection to DynamICCL
SARSA is the on-policy Q-learning analog in DynamICCL: - PPO’s policy gradient is on-policy (uses current policy’s rollouts) - The advantage A(s_t,a_t) = R_{t+1} + γV(S_{t+1}) - V(S_t) is exactly the SARSA TD error applied to the value network - SARSA’s caution around exploration (accounting for ε in future Q estimates) mirrors PPO’s conservative update (small KL divergence from old policy)
If DynamICCL used a tabular setting, SARSA would be the natural on-policy control algorithm. In the continuous state space, SARSA generalizes to neural network Q-functions with online gradient updates (known as online actor-critic or DQN variants).