Temporal-Difference Prediction: TD(0)

Chapter 6 — Temporal-Difference Learning

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 119–132

The Central Idea

Temporal-Difference (TD) learning combines the best of DP and Monte Carlo: - Like MC: model-free — learns from experience, no dynamics model - Like DP: bootstrapping — updates estimates using other estimates, not waiting for episode end

The fundamental TD idea: update V(S_t) after each step using the observed R_{t+1} and the current estimate V(S_{t+1}), not the complete return G_t.

TD(0) Update Rule

MC update (must wait until episode end):

V(S_t) ← V(S_t) + α [G_t - V(S_t)]

TD(0) update (can update immediately after each step):

V(S_t) ← V(S_t) + α [R_{t+1} + γV(S_{t+1}) - V(S_t)]

TD error (δ_t):

δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t)

δ_t > 0: better than expected → V(S_t) was too low → increase it
δ_t < 0: worse than expected → V(S_t) was too high → decrease it
δ_t = 0: exactly as expected → no change

Terminology: - R_{t+1} + γV(S_{t+1}) = TD target (bootstrap estimate of return) - δ_t = TD error = prediction error = temporal credit signal

Algorithm: TD(0) Policy Evaluation

Input: policy π, stepsize α ∈ (0,1]

Initialize: V(s) = 0 for all s ∈ S; V(terminal) = 0

Loop (episodes):
  Initialize S_0
  For each step t = 0, 1, 2, ...:
    A_t ← π(S_t)              ← follow policy π
    Observe R_{t+1}, S_{t+1}  ← from environment
    V(S_t) ← V(S_t) + α [R_{t+1} + γV(S_{t+1}) - V(S_t)]  ← TD update
    S_t ← S_{t+1}
  Until S_t is terminal

Key feature: the update happens online — one step at a time, no need to wait for episode end. Works for continuing tasks (γ < 1) — unlike MC.

Why TD is Better than MC: The Prediction Error Argument

MC target: G_t = R_{t+1} + γR_{t+2} + … + γ^{T-t-1}R_T

This is a sample of the true expected return — unbiased but high variance (T-t random variables).

TD target: R_{t+1} + γV(S_{t+1})

This is biased (V(S_{t+1}) ≠ V^π(S_{t+1}) in general) but much lower variance (only 1 random variable: R_{t+1}).

The variance-bias tradeoff:

Method	Bias	Variance	Online?	Model-free?
DP	0 (exact)	0 (exact)	Yes	No
MC	0 (unbiased)	O(T)	No	Yes
TD(0)	Yes (bootstrap)	O(1)	Yes	Yes
n-step TD	Yes (less)	O(n)	Yes	Yes

Convergence of TD(0)

Theorem (Sutton 1988): for any fixed policy π and step sizes satisfying Robbins-Monro (Σ α_t = ∞, Σ α_t² < ∞), TD(0) converges to V^π with probability 1.

Proof sketch (linear case): TD(0) with function approximation V(s; θ) = θ^T φ(s) performs semi-gradient descent on the mean squared Bellman error. For tabular (V(s) is just a lookup), TD(0) is equivalent to stochastic approximation of the Bellman evaluation equation.

Rate of convergence: TD(0) converges at rate O(1/t) for decaying stepsizes, or geometrically for constant α (approximate convergence to a region around V^π).

Comparison to MC: TD(0) often converges faster than MC in practice despite being biased — because lower variance updates are more informative per step.

Worked Example: Random Walk (S&B Example 6.2)

Setup:

States: A B C D E  (plus terminal states at each end)
         1 2 3 4 5

Agent starts at C (state 3)
Moves left or right with probability 0.5 each
Left terminal: reward = 0
Right terminal: reward = 1
All other transitions: reward = 0
γ = 1

True values: V^π(A)=1/6, V^π(B)=2/6, V^π(C)=3/6, V^π(D)=4/6, V^π(E)=5/6 (probability of eventually reaching right terminal under random walk)

TD(0) convergence (α = 0.05): - After 1 episode: small updates near visited states - After 10 episodes: rough approximation - After 100 episodes: close to true values - After 300 episodes: essentially exact

MC convergence (same α): - TD(0) converges faster to true values (lower variance per-step updates) - After 100 episodes: MC still noisier than TD

Root Mean Squared Error (averaged over states vs episodes): - TD(0) reaches lower RMS error after ~100 episodes than MC - Demonstrates TD’s practical advantage for prediction

Temporal-Difference as Prediction Error Signal

The TD error δ_t has a profound interpretation: - In psychology: resembles Rescorla-Wagner rule for classical conditioning - In neuroscience: resembles dopaminergic signals in the basal ganglia (Ch.15) - In economics: resembles price changes in rational expectations models

Dopamine connection (Montague et al., 1996): dopamine neurons fire not when reward arrives, but when reward is better than predicted (δ_t > 0). If reward is fully predicted, dopamine response = 0. If predicted reward is omitted, dopamine drops (δ_t < 0). This is exactly the TD error!

TD Error and Monte Carlo Error Relationship

Key identity (S&B Exercise 6.1):

G_t - V(S_t) = δ_t + γδ_{t+1} + γ²δ_{t+2} + ... + γ^{T-t-1}δ_{T-1}
             = Σ_{k=t}^{T-1} γ^{k-t} δ_k

The MC error (G_t - V(S_t)) = sum of all future TD errors!

Implication: TD errors form a “decomposition” of the MC error across time steps. MC assigns the full error to a single time step; TD distributes it across all steps. This is the essence of eligibility traces (Ch.12).

TD as Sample-Based DP

Compare: - DP: V(s) ← Σ_a π(a|s) Σ_{s’,r} p(s’,r|s,a) [r + γV(s’)] — exact expectation - TD: V(S_t) ← V(S_t) + α [R_{t+1} + γV(S_{t+1}) - V(S_t)] — sample from distribution

TD replaces the exact expectation (sum over all (s’,r)) with a single sample (R_{t+1}, S_{t+1}).

This is the fundamental insight: > TD = DP with samples instead of model averages

DP requires a model to compute the expectation. TD gets the expectation implicitly by averaging over many samples via the iterative update rule.

Batch TD and MC: Different Convergence Points

Batch training: fix a finite dataset D = {episode_1, episode_2, …, episode_k}. Run TD(0) or MC updates on D repeatedly until convergence.

Surprising result: TD(0) and MC converge to different solutions: - MC converges to: values that minimize MSE on the observed returns (minimum variance unbiased estimate) - TD(0) converges to: values that are consistent with the maximum likelihood MDP — the model that best explains the data

Example (S&B Example 6.4 — AB example): - 8 episodes observed; B always visited before terminal with reward 0 or 1 (equal prob) - A visited once, then B, then reward 0 - MC: V(B) = 3/4 (from 6 observations: 6 reward-1, 2 reward-0); V(A) = 0 (A→B→0 always) - TD: V(B) = 3/4; V(A) = 3/4 (A → B, so V(A) should match V(B))

Which is right? TD is right if the model is stationary — A and B are the same state (A always leads to B). MC treats A’s single observation (reward 0) as definitive, which is unstable.

TD’s “certainty equivalence” (best model fit) is generally better for prediction.

Connection to DynamICCL

TD(0) prediction is the core of DynamICCL’s value function learning: 1. V(s_t) estimate: the PPO value network V(s; θ) is updated using TD targets 2. TD error δ_t: used as the advantage signal A(s_t,a_t) ≈ δ_t 3. Online updates: DynamICCL updates V(s; θ) after each NCCL decision step — not waiting for job completion 4. Dopamine analog: when NCCL throughput exceeds the predicted value (δ_t > 0), the policy reinforces the chosen configuration

The TD update rule’s ability to work online without complete episodes makes it essential for DynamICCL, where “episodes” (training jobs) can run for thousands of steps.