Q-Learning: Off-Policy TD Control
Chapter 6 — Temporal-Difference Learning
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 131–138
The Q-Learning Algorithm
Q-learning (Watkins, 1989) is the most famous RL algorithm — model-free, off-policy, and provably convergent to Q* (for tabular case).
The key modification from SARSA: instead of using Q(S_{t+1}, A_{t+1}) (the action actually taken), use max_{a’} Q(S_{t+1}, a’) — the greedy action.
Q-learning update:
Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)]
TD error for Q-learning:
δ_t = R_{t+1} + γ max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)
The target is independent of the behavior policy. Q-learning updates toward Q* regardless of which actions are actually taken — this is the off-policy nature.
Algorithm: Q-Learning
Input: stepsize α ∈ (0,1], ε ∈ (0,1)
Initialize: Q(s,a) = 0 for all s ∈ S, a ∈ A; Q(terminal, ·) = 0
Loop (episodes):
Initialize S
Loop (steps in episode):
A ← ε-greedy(Q, S) ← behavior policy (exploratory)
Take action A; observe R, S'
Q(S, A) ← Q(S, A) + α[R + γ max_{a'} Q(S', a') - Q(S, A)] ← Q-learning
S ← S'
Until S is terminal
Note: A_{t+1} is never used in the Q-learning update. The max over a’ doesn’t require knowing what action will be taken — it computes the target using the greedy action value.
On-Policy (SARSA) vs Off-Policy (Q-Learning): The Key Distinction
SARSA (On-Policy)
δ_t = R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)
- A_{t+1} is the actual next action (ε-greedy)
- Learns Q^{π_ε}: value of the exploratory policy
- Cautious: future value includes exploration cost
Q-Learning (Off-Policy)
δ_t = R_{t+1} + γ max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)
- max over all a’ — uses greedy target
- Learns Q*: value of the optimal greedy policy
- Optimistic: assumes greedy action taken in future (ignores exploration cost)
The Cliff Walking Example (S&B Example 6.6)
Setup:
4 × 12 grid
Start: bottom-left; Goal: bottom-right
Cliff: row of states along bottom edge (reward = -100, returns to start)
All other steps: reward = -1
γ = 1
Grid:
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
S C C C C C C C C C C G
←────── CLIFF ───────────────→
Q-learning behavior: walks along cliff edge (shortest path = 11 steps = reward -11), but occasionally falls off due to ε-exploration → high variance, average reward ≈ -25.
SARSA behavior: takes safer path one row above cliff (longer = 13 steps = reward -13), accounting for the risk of falling during ε-exploration → more stable, average reward ≈ -17 (better than Q-learning’s average!).
Key insight: - Q-learning finds the optimal path (along cliff), but executing it with ε-exploration is risky - SARSA finds the optimal policy given exploration — the safe path
Q-learning’s policy is better in the limit (ε → 0), but SARSA performs better during online training with ε > 0.
This demonstrates: off-policy (Q-learning) converges to Q* but may underperform on-policy methods during training in risky environments.
Convergence of Q-Learning
Theorem (Watkins & Dayan, 1992): Q-learning converges to Q* with probability 1 if: 1. All (s,a) pairs are visited infinitely often (any exploratory behavior policy) 2. Robbins-Monro stepsizes: Σ α_t(s,a) = ∞, Σ α_t²(s,a) < ∞
Proof sketch: Show that the Q-learning update is a contraction in the sup-norm. The T* operator (Bellman optimality) is applied in expectation at each step; stochastic approximation guarantees convergence.
Off-policy advantage: the behavior policy b can be anything (random, ε-greedy, human) as long as all (s,a) pairs are covered. Q-learning extracts Q* regardless of how data was collected.
Expected SARSA
A natural generalization: instead of max over Q, take expectation under the target policy:
Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γ Σ_{a'} π(a'|S_{t+1}) Q(S_{t+1}, a') - Q(S_t, A_t)]
For ε-greedy target policy π:
E_π[Q(S_{t+1}, a')] = (1-ε) max_{a'} Q(S_{t+1}, a') + ε/|A| Σ_{a'} Q(S_{t+1}, a')
Properties: - Lower variance than SARSA (replaces sample A_{t+1} with expectation) - If target = greedy policy: identical to Q-learning (max = expected for deterministic) - Can be off-policy if target ≠ behavior
Performance: Expected SARSA typically outperforms both SARSA and Q-learning in practice (lower variance, similar bias).
Double Q-Learning: Fixing Maximization Bias
Problem with Q-learning: max_{a’} Q(S_{t+1}, a’) is a biased estimator of the expected max.
If Q(s,a) has estimation noise η(s,a), then:
E[max_a Q(s,a)] > max_a E[Q(s,a)] = max_a Q*(s,a)
This maximization bias causes Q-learning to overestimate action values.
Example: |A| = 10, all true Q(s,a) = 0, estimates Q̂(s,a) ~ N(0, σ²). E[max_a Q̂(s,a)] = E[max of 10 standard normals × σ] ≈ 1.28σ > 0. Maximization bias = 1.28σ.
Fix: Double Q-Learning (van Hasselt, 2010):
Maintain two independent Q-functions Q^A and Q^B:
With probability 0.5, update Q^A:
Q^A(S_t,A_t) ← Q^A(S_t,A_t) + α[R_{t+1} + γQ^B(S_{t+1}, argmax_{a'} Q^A(S_{t+1},a')) - Q^A(S_t,A_t)]
Otherwise, update Q^B:
Q^B(S_t,A_t) ← Q^B(S_t,A_t) + α[R_{t+1} + γQ^A(S_{t+1}, argmax_{a'} Q^B(S_{t+1},a')) - Q^B(S_t,A_t)]
Key: use one Q-function to select the action (argmax), the other to evaluate it. Decorrelates selection from evaluation → removes maximization bias.
Double DQN (van Hasselt et al., 2016): applies same idea to DQN:
Target = R_{t+1} + γ Q(S_{t+1}, argmax_{a'} Q(S_{t+1}, a'; θ); θ^-)
where θ = online network (selects action), θ^- = target network (evaluates action).
Double DQN significantly outperforms DQN on Atari games, especially where Q-value overestimation causes instability.
Relationship Summary: SARSA vs Q-Learning vs Expected SARSA
| SARSA | Q-Learning | Expected SARSA | |
|---|---|---|---|
| Target | Q(S’,A’) | max_a’ Q(S’,a’) | E_{π}[Q(S’,a’)] |
| On/Off-policy | On-policy | Off-policy | Either |
| Converges to | Q^{π_ε} | Q* | Q* (if target=greedy) |
| Variance | Medium | Medium | Lower |
| Bias | Same | Same (+ max bias) | Same |
| Cliff risk | Safe (accounts for ε) | Risky (ignores ε) | Middle ground |
Connection to DynamICCL
Q-learning is the theoretical foundation for DQN and the deep RL algorithms used in DynamICCL:
- DQN = Q-learning + neural network Q-function + experience replay + target network
- DynamICCL target: same as Q-learning: y_t = R_{t+1} + γ max_{a’} Q(S_{t+1}, a’; θ^-)
- Double DQN fix is important: NCCL configuration values are noisy (network conditions vary), so maximization bias is a real problem
- Q* = optimal NCCL policy: Q(s, NCCL_config) → choose config with highest Q given current network state
The shift from tabular Q-learning (Ch.6) to deep Q-learning (DQN, Ch.9+) is purely about the representation of Q(s,a) — the update rule is identical.