Q-Learning: Off-Policy TD Control

Chapter 6 — Temporal-Difference Learning

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 131–138

The Q-Learning Algorithm

Q-learning (Watkins, 1989) is the most famous RL algorithm — model-free, off-policy, and provably convergent to Q* (for tabular case).

The key modification from SARSA: instead of using Q(S_{t+1}, A_{t+1}) (the action actually taken), use max_{a’} Q(S_{t+1}, a’) — the greedy action.

Q-learning update:

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)]

TD error for Q-learning:

δ_t = R_{t+1} + γ max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)

The target is independent of the behavior policy. Q-learning updates toward Q* regardless of which actions are actually taken — this is the off-policy nature.

Algorithm: Q-Learning

Input: stepsize α ∈ (0,1], ε ∈ (0,1)

Initialize: Q(s,a) = 0 for all s ∈ S, a ∈ A; Q(terminal, ·) = 0

Loop (episodes):
  Initialize S

  Loop (steps in episode):
    A ← ε-greedy(Q, S)             ← behavior policy (exploratory)
    Take action A; observe R, S'
    Q(S, A) ← Q(S, A) + α[R + γ max_{a'} Q(S', a') - Q(S, A)]  ← Q-learning
    S ← S'
  Until S is terminal

Note: A_{t+1} is never used in the Q-learning update. The max over a’ doesn’t require knowing what action will be taken — it computes the target using the greedy action value.

On-Policy (SARSA) vs Off-Policy (Q-Learning): The Key Distinction

SARSA (On-Policy)

δ_t = R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)

A_{t+1} is the actual next action (ε-greedy)
Learns Q^{π_ε}: value of the exploratory policy
Cautious: future value includes exploration cost

Q-Learning (Off-Policy)

δ_t = R_{t+1} + γ max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)

max over all a’ — uses greedy target
Learns Q*: value of the optimal greedy policy
Optimistic: assumes greedy action taken in future (ignores exploration cost)

The Cliff Walking Example (S&B Example 6.6)

Setup:

4 × 12 grid
Start: bottom-left; Goal: bottom-right
Cliff: row of states along bottom edge (reward = -100, returns to start)
All other steps: reward = -1
γ = 1

Grid:
  .  .  .  .  .  .  .  .  .  .  .  .
  .  .  .  .  .  .  .  .  .  .  .  .
  .  .  .  .  .  .  .  .  .  .  .  .
  S  C  C  C  C  C  C  C  C  C  C  G
     ←────── CLIFF ───────────────→

Q-learning behavior: walks along cliff edge (shortest path = 11 steps = reward -11), but occasionally falls off due to ε-exploration → high variance, average reward ≈ -25.

SARSA behavior: takes safer path one row above cliff (longer = 13 steps = reward -13), accounting for the risk of falling during ε-exploration → more stable, average reward ≈ -17 (better than Q-learning’s average!).

Key insight: - Q-learning finds the optimal path (along cliff), but executing it with ε-exploration is risky - SARSA finds the optimal policy given exploration — the safe path

Q-learning’s policy is better in the limit (ε → 0), but SARSA performs better during online training with ε > 0.

This demonstrates: off-policy (Q-learning) converges to Q* but may underperform on-policy methods during training in risky environments.

Convergence of Q-Learning

Theorem (Watkins & Dayan, 1992): Q-learning converges to Q* with probability 1 if: 1. All (s,a) pairs are visited infinitely often (any exploratory behavior policy) 2. Robbins-Monro stepsizes: Σ α_t(s,a) = ∞, Σ α_t²(s,a) < ∞

Proof sketch: Show that the Q-learning update is a contraction in the sup-norm. The T* operator (Bellman optimality) is applied in expectation at each step; stochastic approximation guarantees convergence.

Off-policy advantage: the behavior policy b can be anything (random, ε-greedy, human) as long as all (s,a) pairs are covered. Q-learning extracts Q* regardless of how data was collected.

Expected SARSA

A natural generalization: instead of max over Q, take expectation under the target policy:

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γ Σ_{a'} π(a'|S_{t+1}) Q(S_{t+1}, a') - Q(S_t, A_t)]

For ε-greedy target policy π:

E_π[Q(S_{t+1}, a')] = (1-ε) max_{a'} Q(S_{t+1}, a') + ε/|A| Σ_{a'} Q(S_{t+1}, a')

Properties: - Lower variance than SARSA (replaces sample A_{t+1} with expectation) - If target = greedy policy: identical to Q-learning (max = expected for deterministic) - Can be off-policy if target ≠ behavior

Performance: Expected SARSA typically outperforms both SARSA and Q-learning in practice (lower variance, similar bias).

Double Q-Learning: Fixing Maximization Bias

Problem with Q-learning: max_{a’} Q(S_{t+1}, a’) is a biased estimator of the expected max.

If Q(s,a) has estimation noise η(s,a), then:

E[max_a Q(s,a)] > max_a E[Q(s,a)] = max_a Q*(s,a)

This maximization bias causes Q-learning to overestimate action values.

Example: |A| = 10, all true Q(s,a) = 0, estimates Q̂(s,a) ~ N(0, σ²). E[max_a Q̂(s,a)] = E[max of 10 standard normals × σ] ≈ 1.28σ > 0. Maximization bias = 1.28σ.

Fix: Double Q-Learning (van Hasselt, 2010):

Maintain two independent Q-functions Q^A and Q^B:

With probability 0.5, update Q^A:
  Q^A(S_t,A_t) ← Q^A(S_t,A_t) + α[R_{t+1} + γQ^B(S_{t+1}, argmax_{a'} Q^A(S_{t+1},a')) - Q^A(S_t,A_t)]

Otherwise, update Q^B:
  Q^B(S_t,A_t) ← Q^B(S_t,A_t) + α[R_{t+1} + γQ^A(S_{t+1}, argmax_{a'} Q^B(S_{t+1},a')) - Q^B(S_t,A_t)]

Key: use one Q-function to select the action (argmax), the other to evaluate it. Decorrelates selection from evaluation → removes maximization bias.

Double DQN (van Hasselt et al., 2016): applies same idea to DQN:

Target = R_{t+1} + γ Q(S_{t+1}, argmax_{a'} Q(S_{t+1}, a'; θ); θ^-)

where θ = online network (selects action), θ^- = target network (evaluates action).

Double DQN significantly outperforms DQN on Atari games, especially where Q-value overestimation causes instability.

Relationship Summary: SARSA vs Q-Learning vs Expected SARSA

	SARSA	Q-Learning	Expected SARSA
Target	Q(S’,A’)	max_a’ Q(S’,a’)	E_{π}[Q(S’,a’)]
On/Off-policy	On-policy	Off-policy	Either
Converges to	Q^{π_ε}	Q*	Q* (if target=greedy)
Variance	Medium	Medium	Lower
Bias	Same	Same (+ max bias)	Same
Cliff risk	Safe (accounts for ε)	Risky (ignores ε)	Middle ground

Connection to DynamICCL

Q-learning is the theoretical foundation for DQN and the deep RL algorithms used in DynamICCL:

DQN = Q-learning + neural network Q-function + experience replay + target network
DynamICCL target: same as Q-learning: y_t = R_{t+1} + γ max_{a’} Q(S_{t+1}, a’; θ^-)
Double DQN fix is important: NCCL configuration values are noisy (network conditions vary), so maximization bias is a real problem
Q* = optimal NCCL policy: Q(s, NCCL_config) → choose config with highest Q given current network state

The shift from tabular Q-learning (Ch.6) to deep Q-learning (DQN, Ch.9+) is purely about the representation of Q(s,a) — the update rule is identical.