TD Analysis, Optimality, and Advanced Topics

Chapter 6 — Temporal-Difference Learning

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 132–143

TD(0) Convergence Analysis: Stochastic Approximation

TD(0) is an instance of stochastic approximation (Robbins-Monro, 1951).

The general form:

θ_{t+1} = θ_t + α_t [f(θ_t) + ε_t]

where f(θ) is the desired update direction and ε_t is zero-mean noise.

For TD(0): θ = V, f(V)(s) = E_π[R_{t+1} + γV(S_{t+1}) - V(S_t) | S_t = s] = T^π V(s) - V(s) where T^π is the Bellman evaluation operator.

The fixed point of f(V) = 0 is V^π. Since T^π is a γ-contraction: - f(V)(s) = 0 iff V = V^π - Noise ε_t = (R_{t+1} + γV(S_{t+1})) - (T^π V)(s) has zero conditional mean

Robbins-Monro guarantee: with Σ α_t = ∞ and Σ α_t² < ∞, θ_t → θ* where f(θ*) = 0.

Rate: for constant α, convergence to a neighborhood of V^π; for decaying α, exact convergence at rate O(1/t) after appropriate burn-in.

Why TD(0) Beats MC on Batch Data

Batch setting: fixed dataset D = {episode_1, …, episode_k}. Run updates repeatedly until convergence.

Theorem (Sutton): batch TD(0) and MC converge to different solutions: - MC: minimizes Σ_s μ(s) [V(s) - V^π(s)]² (minimum squared error on observed returns) - TD(0): finds V consistent with the maximum likelihood model M̂ derived from D

Maximum likelihood model M̂ from data D:

p̂(s'|s,a) = (count(s,a,s') in D) / (count(s,a) in D)
r̂(s,a,s') = mean reward observed for (s,a,s') transitions

Batch TD(0) converges to the V^π for the MDP (S, A, p̂, r̂, γ) — the best-fit MDP given D.

Why TD is better: TD exploits the MDP structure. It “knows” that V(s) should be consistent with Bellman equations — so it propagates information through the MDP transitions, reducing estimation error for states with few direct observations.

The AB Example: Batch TD vs MC

Data (S&B Example 6.3):

8 episodes observed:
  A → 0   (1 time): episode: A, B, reward=0
  B → 1   (6 times): episode: B, reward=1
  B → 0   (2 times): episode: B, reward=0

MC estimate: - V_MC(B) = 6/8 = 0.75 (average return from B = 6 × 1 + 2 × 0 = 6/8) - V_MC(A) = 0 (only one observation, return was 0)

TD estimate: - V_TD(B) = 0.75 (same as MC — enough B observations) - V_TD(A) = V_TD(B) = 0.75 (A → B always, so V(A) = V(B))

Which is correct? Given only this data: - If A always leads to B, then V(A) should equal V(B) = 0.75 → TD is right - MC’s V(A) = 0 ignores the MDP structure (A → B transition)

TD is a better estimator when the MDP structure is correct — which is usually the case.

Unified View: DP vs MC vs TD

            Updates use model?  Bootstraps?  Online?
DP              Yes               Yes         Yes
MC              No                No          No
TD              No                Yes         Yes

“If you have the model, use DP. If you don’t, use TD.” — Sutton

The convergence hierarchy:

Exact DP → Asymptotic TD(0) → Asymptotic MC

(DP: no noise; TD: bootstrap noise; MC: full return noise)

TD occupies the sweet spot: no model required (vs DP), but lower variance than MC.

Backup Depth and Width

           Width (breadth of backup)
           Full               Sample
Depth 1:   DP policy eval     TD(0) / SARSA / Q-learning
(1-step)

Depth n:   n-step DP           n-step TD (Ch.7)
(n-step)

Depth T:   Full DP             Monte Carlo
(episode)

The dimensions: - Width: DP averages over all possible (s’, r) — “wide backup”; TD samples one (s’, r) — “narrow backup” - Depth: DP/TD(0) back up 1 step; MC backs up T steps

TD(λ) (Ch.12) varies depth; function approximation (Ch.9) varies the representation — these are the two independent dimensions of RL method design.

Practical Considerations: Choosing α

Too large α: diverges (updates oscillate) Too small α: converges slowly

For tabular TD(0): recommended starting α = 0.1. Decrease over time for exact convergence.

For neural network approximation: use standard neural network learning rates (1e-3 to 1e-4 with Adam optimizer). The Robbins-Monro conditions are violated by constant α, but in practice, the approximate convergence is sufficient.

Stepsize matrix (more advanced): instead of scalar α, use H^{-1} (inverse of Hessian or Fisher information matrix) for natural gradient. This is the basis of TRPO and A-matrix in PPO.

TD and the Deadly Triad (Preview of Ch.11)

The deadly triad: three ingredients that together can cause divergence: 1. Function approximation (neural nets, linear) 2. Bootstrapping (TD methods) 3. Off-policy learning

All three are present in DQN and most deep RL algorithms. Why doesn’t DQN diverge? - Target network (frozen weights for TD target) adds stability - Experience replay breaks correlation between consecutive samples - Clipping/huber loss limits gradient magnitude

But divergence is still possible! Many DQN variants fail in edge cases. This is an active research area.

Chapter 6 Summary

Algorithm	Policy	Convergence	Target
TD(0) prediction	Fixed π	V^π (with probability 1)	R + γV(S’)
SARSA	On-policy (ε-greedy)	Q^{π_ε}	R + γQ(S’,A’)
Q-learning	Off-policy	Q*	R + γ max Q(S’,a’)
Expected SARSA	Either	Q* (if target=greedy)	R + γ E_π[Q(S’,a’)]
Double Q-learning	Off-policy	Q* (less bias)	R + γ Q^B(S’, argmax_{a’} Q^A(S’,a’))

Critical insights: 1. TD = MC + bootstrapping — lower variance, slight bias 2. Q-learning = model-free approximation to value iteration 3. SARSA = on-policy Q-learning; safer in environments with exploration risks 4. Double Q-learning removes maximization bias (important for deep RL) 5. Batch TD converges to MDP-consistent values (better than MC)

Connection to DynamICCL

The entire DynamICCL Q-value learning is built on Q-learning:

NCCL Config Selection (Q-learning):
  State: current network state s_t
  Action: NCCL config a_t (algorithm, chunk size, pipeline)
  Reward: r_t = throughput improvement
  Update: Q(s_t, a_t) ← Q(s_t, a_t) + α[r_t + γ max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)]

Extensions used in DynamICCL: - Neural network Q-function: Q(s, a; θ) replaces tabular Q - Double Q / target network: prevents maximization bias and stabilizes training - PPO (Ch.13): policy gradient variant that also implicitly does Q-learning via advantage estimates