Eligibility Traces and TD(λ)
Chapter 12 — Eligibility Traces
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 289–318
Motivation: Unifying TD and MC
TD(0) uses 1-step returns; MC uses full episode returns. n-step TD interpolates discretely. Eligibility traces provide a continuous, online interpolation via the λ-return.
Two views of TD(λ): 1. Forward view (theoretical): define G_t^λ = (1-λ) Σ_{n=1}^∞ λ^{n-1} G_t^{(n)} and update toward it 2. Backward view (computational): maintain a trace e_t(s) that records which states are “eligible” for update, and update all states proportional to their trace
The backward view is the algorithm; the forward view is the mathematical justification.
The Eligibility Trace
Accumulating trace (classic):
e_t(s) = 0 ← initialize each episode
e_t(s) ← γλ e_{t-1}(s) + 1[S_t = s] ← update at each step
- Decays: each step, e(s) multiplied by γλ < 1 → exponential decay
- Accumulates: each time S_t = s, e(s) jumps up by 1
Replacing trace (Dutch trace):
e_t(s) = γλ e_{t-1}(s) + 1[S_t = s] for S_t ≠ s
e_t(s) = 1 for S_t = s ← reset to 1, no accumulation
Replacing traces prevent large spikes in frequently-visited states.
Dutch trace (linear FA only):
e_t ← γλ e_{t-1} + (1 - α γλ e_{t-1}^T φ_t) φ_t
TD(λ) Algorithm (Backward View)
Initialize: V(s) = 0 for all s; e(s) = 0 for all s
Loop (episodes):
Initialize S_0; e(s) = 0 for all s ← reset traces each episode
For each step t:
A_t ← π(S_t)
Observe R_{t+1}, S_{t+1}
δ_t ← R_{t+1} + γV(S_{t+1}) - V(S_t) ← TD error
e(S_t) ← e(S_t) + 1 ← update trace for visited state
For each s ∈ S:
V(s) ← V(s) + α δ_t e(s) ← update ALL states proportional to trace
e(s) ← γλ e(s) ← decay all traces
Computational cost: O(|S|) per step — every state’s value updated at every step!
Linear FA optimization (only update θ, not all V(s)):
e_t ← γλ e_{t-1} + ∇_θ V(S_t; θ) = γλ e_{t-1} + φ(S_t) ← trace vector
θ ← θ + α δ_t e_t ← single vector update (O(d))
Much more efficient: O(d) per step instead of O(|S|).
The Forward-Backward Equivalence
Theorem: For offline updates (apply all updates at episode end), the forward view (update toward G_t^λ) and backward view (TD(λ) with traces) give identical parameter updates.
Proof sketch:
G_t^λ - V(S_t) = Σ_{k=t}^{T-1} (γλ)^{k-t} δ_k
(This is the identity from Ch.6: MC error = sum of TD errors; generalized to λ-weighted sum)
Backward view: Σ_t α δ_t e_t(s) = Σ_t Σ_{k≤t} α (γλ)^{t-k} δ_t 1[S_k=s] = Σ_k α 1[S_k=s] Σ_{t≥k} (γλ)^{t-k} δ_t = Σ_k α 1[S_k=s] (G_k^λ - V(S_k)) ← forward view update
The two are equivalent! ∎
Practical implication: backward view (traces) is the online-efficient algorithm; forward view explains why it works.
TD(λ) for Control: SARSA(λ)
Apply traces to SARSA for on-policy control:
e(s,a) = 0 for all s,a ← action-value trace
For each step:
A_t ← ε-greedy(Q, S_t)
Observe R_{t+1}, S_{t+1}
A_{t+1} ← ε-greedy(Q, S_{t+1})
δ_t = R_{t+1} + γQ(S_{t+1},A_{t+1}) - Q(S_t,A_t)
e(S_t,A_t) ← e(S_t,A_t) + 1 ← or replacing trace: set to 1
For all s, a:
Q(s,a) ← Q(s,a) + α δ_t e(s,a)
e(s,a) ← γλ e(s,a)
SARSA(λ) convergence: same as SARSA but faster credit assignment. With λ→1: approaches MC control; with λ=0: exact SARSA.
Watkins Q(λ): Off-Policy with Traces
Problem: off-policy TD(λ) with Q-learning is tricky — traces must be cut when the behavior policy takes a non-greedy action.
Watkins Q(λ): cut trace when non-greedy action taken:
If A_t ≠ argmax_a Q(S_t, a):
e(S_t, A_t) ← 0 ← cut trace here
e(s,a) ← 0 for all other s,a
Alternatively: use IS ratios for each step in the trace (Precup et al., 2000 — “off-policy TD with IS”).
True Online TD(λ): Exact Forward-Backward Equivalence Online
Standard TD(λ) is equivalent to forward view offline (at episode end). True online TD(λ) achieves the exact forward-backward equivalence online (at each step):
h_{t+1} ← h_t + α[R + γV(S') - θ^T_t φ_t] e_t - α[V(S_t) - θ_t^T φ_t]φ_t
δ_t = R + γV(S') - V(S)
e_t ← γλ e_{t-1} + (1 - α γλ e_{t-1}^T φ_t) φ_t ← Dutch trace
θ_{t+1} ← θ_t + α δ_t e_t - α [V_t - V_{t-1}] φ_t ← corrected update
True Online TD(λ) outperforms standard TD(λ) by maintaining the exact equivalence at every step.
Performance of TD(λ) vs TD(0) and MC
19-state random walk (S&B Figure 12.3):
RMS Error over 10 episodes (averaged):
α\λ 0.0 0.3 0.5 0.7 0.9 1.0
0.01 0.19 0.18 0.17 0.16 0.16 0.17
0.1 0.17 0.15 0.13 0.12 0.12 0.16
0.5 0.19 0.17 0.15 0.14 0.17 0.21
Best performance: around λ=0.7-0.9 — confirming intermediate λ is best. λ=0: TD(0) — good but not best; λ=1: MC — often worse due to high variance.
Eligibility Traces as Memory
Biological interpretation: eligibility traces resemble synaptic eligibility in neural plasticity (Hebbian learning, STDP). A synapse becomes “eligible” when activated; reward signal modulates how much this eligibility is converted to weight change. This matches the theoretical RL trace mechanism almost exactly.
Chapter 12 Summary
- Eligibility traces are decaying memory of recently-visited states: e_t(s) = γλ e_{t-1}(s) + 1[S_t=s]
- TD(λ) = backward view: update all states proportional to trace and TD error
- Forward-backward equivalence: TD(λ) computes the same as updating toward G_t^λ
- SARSA(λ): on-policy control with traces; faster credit assignment than SARSA
- Optimal λ ≈ 0.7–0.9 in practice; λ=0 → TD(0); λ=1 → MC
- Linear FA: trace = feature vector; O(d) per step → efficient
- True Online TD(λ): exact forward-backward equivalence at every step
Connection to DynamICCL
GAE in PPO is the function-approximation version of eligibility traces:
GAE Â_t = Σ_{l=0}^∞ (γλ)^l δ_{t+l} ← forward view of TD(λ)
≡ eligibility trace accumulation ← backward view
PPO implements this forward view explicitly (backward pass over trajectory).
For DynamICCL with linear state features, eligibility traces could replace GAE for more online (step-by-step) updates rather than batch rollout updates — trading latency for data efficiency.