TD Analysis, Optimality, and Advanced Topics
Chapter 6 — Temporal-Difference Learning
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 132–143
TD(0) Convergence Analysis: Stochastic Approximation
TD(0) is an instance of stochastic approximation (Robbins-Monro, 1951).
The general form:
θ_{t+1} = θ_t + α_t [f(θ_t) + ε_t]
where f(θ) is the desired update direction and ε_t is zero-mean noise.
For TD(0): θ = V, f(V)(s) = E_π[R_{t+1} + γV(S_{t+1}) - V(S_t) | S_t = s] = T^π V(s) - V(s) where T^π is the Bellman evaluation operator.
The fixed point of f(V) = 0 is V^π. Since T^π is a γ-contraction: - f(V)(s) = 0 iff V = V^π - Noise ε_t = (R_{t+1} + γV(S_{t+1})) - (T^π V)(s) has zero conditional mean
Robbins-Monro guarantee: with Σ α_t = ∞ and Σ α_t² < ∞, θ_t → θ* where f(θ*) = 0.
Rate: for constant α, convergence to a neighborhood of V^π; for decaying α, exact convergence at rate O(1/t) after appropriate burn-in.
Why TD(0) Beats MC on Batch Data
Batch setting: fixed dataset D = {episode_1, …, episode_k}. Run updates repeatedly until convergence.
Theorem (Sutton): batch TD(0) and MC converge to different solutions: - MC: minimizes Σ_s μ(s) [V(s) - V^π(s)]² (minimum squared error on observed returns) - TD(0): finds V consistent with the maximum likelihood model M̂ derived from D
Maximum likelihood model M̂ from data D:
p̂(s'|s,a) = (count(s,a,s') in D) / (count(s,a) in D)
r̂(s,a,s') = mean reward observed for (s,a,s') transitions
Batch TD(0) converges to the V^π for the MDP (S, A, p̂, r̂, γ) — the best-fit MDP given D.
Why TD is better: TD exploits the MDP structure. It “knows” that V(s) should be consistent with Bellman equations — so it propagates information through the MDP transitions, reducing estimation error for states with few direct observations.
The AB Example: Batch TD vs MC
Data (S&B Example 6.3):
8 episodes observed:
A → 0 (1 time): episode: A, B, reward=0
B → 1 (6 times): episode: B, reward=1
B → 0 (2 times): episode: B, reward=0
MC estimate: - V_MC(B) = 6/8 = 0.75 (average return from B = 6 × 1 + 2 × 0 = 6/8) - V_MC(A) = 0 (only one observation, return was 0)
TD estimate: - V_TD(B) = 0.75 (same as MC — enough B observations) - V_TD(A) = V_TD(B) = 0.75 (A → B always, so V(A) = V(B))
Which is correct? Given only this data: - If A always leads to B, then V(A) should equal V(B) = 0.75 → TD is right - MC’s V(A) = 0 ignores the MDP structure (A → B transition)
TD is a better estimator when the MDP structure is correct — which is usually the case.
Unified View: DP vs MC vs TD
Updates use model? Bootstraps? Online?
DP Yes Yes Yes
MC No No No
TD No Yes Yes
“If you have the model, use DP. If you don’t, use TD.” — Sutton
The convergence hierarchy:
Exact DP → Asymptotic TD(0) → Asymptotic MC
(DP: no noise; TD: bootstrap noise; MC: full return noise)
TD occupies the sweet spot: no model required (vs DP), but lower variance than MC.
Backup Depth and Width
Width (breadth of backup)
Full Sample
Depth 1: DP policy eval TD(0) / SARSA / Q-learning
(1-step)
Depth n: n-step DP n-step TD (Ch.7)
(n-step)
Depth T: Full DP Monte Carlo
(episode)
The dimensions: - Width: DP averages over all possible (s’, r) — “wide backup”; TD samples one (s’, r) — “narrow backup” - Depth: DP/TD(0) back up 1 step; MC backs up T steps
TD(λ) (Ch.12) varies depth; function approximation (Ch.9) varies the representation — these are the two independent dimensions of RL method design.
Practical Considerations: Choosing α
Too large α: diverges (updates oscillate) Too small α: converges slowly
For tabular TD(0): recommended starting α = 0.1. Decrease over time for exact convergence.
For neural network approximation: use standard neural network learning rates (1e-3 to 1e-4 with Adam optimizer). The Robbins-Monro conditions are violated by constant α, but in practice, the approximate convergence is sufficient.
Stepsize matrix (more advanced): instead of scalar α, use H^{-1} (inverse of Hessian or Fisher information matrix) for natural gradient. This is the basis of TRPO and A-matrix in PPO.
TD and the Deadly Triad (Preview of Ch.11)
The deadly triad: three ingredients that together can cause divergence: 1. Function approximation (neural nets, linear) 2. Bootstrapping (TD methods) 3. Off-policy learning
All three are present in DQN and most deep RL algorithms. Why doesn’t DQN diverge? - Target network (frozen weights for TD target) adds stability - Experience replay breaks correlation between consecutive samples - Clipping/huber loss limits gradient magnitude
But divergence is still possible! Many DQN variants fail in edge cases. This is an active research area.
Chapter 6 Summary
| Algorithm | Policy | Convergence | Target |
|---|---|---|---|
| TD(0) prediction | Fixed π | V^π (with probability 1) | R + γV(S’) |
| SARSA | On-policy (ε-greedy) | Q^{π_ε} | R + γQ(S’,A’) |
| Q-learning | Off-policy | Q* | R + γ max Q(S’,a’) |
| Expected SARSA | Either | Q* (if target=greedy) | R + γ E_π[Q(S’,a’)] |
| Double Q-learning | Off-policy | Q* (less bias) | R + γ Q^B(S’, argmax_{a’} Q^A(S’,a’)) |
Critical insights: 1. TD = MC + bootstrapping — lower variance, slight bias 2. Q-learning = model-free approximation to value iteration 3. SARSA = on-policy Q-learning; safer in environments with exploration risks 4. Double Q-learning removes maximization bias (important for deep RL) 5. Batch TD converges to MDP-consistent values (better than MC)
Connection to DynamICCL
The entire DynamICCL Q-value learning is built on Q-learning:
NCCL Config Selection (Q-learning):
State: current network state s_t
Action: NCCL config a_t (algorithm, chunk size, pipeline)
Reward: r_t = throughput improvement
Update: Q(s_t, a_t) ← Q(s_t, a_t) + α[r_t + γ max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)]
Extensions used in DynamICCL: - Neural network Q-function: Q(s, a; θ) replaces tabular Q - Double Q / target network: prevents maximization bias and stabilizes training - PPO (Ch.13): policy gradient variant that also implicitly does Q-learning via advantage estimates