Eligibility Traces and TD(λ)

Chapter 12 — Eligibility Traces

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 289–318


Motivation: Unifying TD and MC

TD(0) uses 1-step returns; MC uses full episode returns. n-step TD interpolates discretely. Eligibility traces provide a continuous, online interpolation via the λ-return.

Two views of TD(λ): 1. Forward view (theoretical): define G_t^λ = (1-λ) Σ_{n=1}^∞ λ^{n-1} G_t^{(n)} and update toward it 2. Backward view (computational): maintain a trace e_t(s) that records which states are “eligible” for update, and update all states proportional to their trace

The backward view is the algorithm; the forward view is the mathematical justification.


The Eligibility Trace

Accumulating trace (classic):

e_t(s) = 0                           ← initialize each episode
e_t(s) ← γλ e_{t-1}(s) + 1[S_t = s]  ← update at each step

Replacing trace (Dutch trace):

e_t(s) = γλ e_{t-1}(s) + 1[S_t = s]   for S_t ≠ s
e_t(s) = 1                              for S_t = s   ← reset to 1, no accumulation

Replacing traces prevent large spikes in frequently-visited states.

Dutch trace (linear FA only):

e_t ← γλ e_{t-1} + (1 - α γλ e_{t-1}^T φ_t) φ_t

TD(λ) Algorithm (Backward View)

Initialize: V(s) = 0 for all s; e(s) = 0 for all s

Loop (episodes):
  Initialize S_0; e(s) = 0 for all s    ← reset traces each episode

  For each step t:
    A_t ← π(S_t)
    Observe R_{t+1}, S_{t+1}
    δ_t ← R_{t+1} + γV(S_{t+1}) - V(S_t)    ← TD error
    e(S_t) ← e(S_t) + 1                       ← update trace for visited state

    For each s ∈ S:
      V(s) ← V(s) + α δ_t e(s)               ← update ALL states proportional to trace
      e(s) ← γλ e(s)                          ← decay all traces

Computational cost: O(|S|) per step — every state’s value updated at every step!

Linear FA optimization (only update θ, not all V(s)):

e_t ← γλ e_{t-1} + ∇_θ V(S_t; θ)  = γλ e_{t-1} + φ(S_t)  ← trace vector
θ ← θ + α δ_t e_t                  ← single vector update (O(d))

Much more efficient: O(d) per step instead of O(|S|).


The Forward-Backward Equivalence

Theorem: For offline updates (apply all updates at episode end), the forward view (update toward G_t^λ) and backward view (TD(λ) with traces) give identical parameter updates.

Proof sketch:

G_t^λ - V(S_t) = Σ_{k=t}^{T-1} (γλ)^{k-t} δ_k

(This is the identity from Ch.6: MC error = sum of TD errors; generalized to λ-weighted sum)

Backward view: Σ_t α δ_t e_t(s) = Σ_t Σ_{k≤t} α (γλ)^{t-k} δ_t 1[S_k=s] = Σ_k α 1[S_k=s] Σ_{t≥k} (γλ)^{t-k} δ_t = Σ_k α 1[S_k=s] (G_k^λ - V(S_k)) ← forward view update

The two are equivalent! ∎

Practical implication: backward view (traces) is the online-efficient algorithm; forward view explains why it works.


TD(λ) for Control: SARSA(λ)

Apply traces to SARSA for on-policy control:

e(s,a) = 0 for all s,a    ← action-value trace

For each step:
  A_t ← ε-greedy(Q, S_t)
  Observe R_{t+1}, S_{t+1}
  A_{t+1} ← ε-greedy(Q, S_{t+1})
  δ_t = R_{t+1} + γQ(S_{t+1},A_{t+1}) - Q(S_t,A_t)
  e(S_t,A_t) ← e(S_t,A_t) + 1                    ← or replacing trace: set to 1

  For all s, a:
    Q(s,a) ← Q(s,a) + α δ_t e(s,a)
    e(s,a) ← γλ e(s,a)

SARSA(λ) convergence: same as SARSA but faster credit assignment. With λ→1: approaches MC control; with λ=0: exact SARSA.


Watkins Q(λ): Off-Policy with Traces

Problem: off-policy TD(λ) with Q-learning is tricky — traces must be cut when the behavior policy takes a non-greedy action.

Watkins Q(λ): cut trace when non-greedy action taken:

If A_t ≠ argmax_a Q(S_t, a):
  e(S_t, A_t) ← 0   ← cut trace here
  e(s,a) ← 0 for all other s,a

Alternatively: use IS ratios for each step in the trace (Precup et al., 2000 — “off-policy TD with IS”).


True Online TD(λ): Exact Forward-Backward Equivalence Online

Standard TD(λ) is equivalent to forward view offline (at episode end). True online TD(λ) achieves the exact forward-backward equivalence online (at each step):

h_{t+1} ← h_t + α[R + γV(S') - θ^T_t φ_t] e_t - α[V(S_t) - θ_t^T φ_t]φ_t
δ_t = R + γV(S') - V(S)
e_t ← γλ e_{t-1} + (1 - α γλ e_{t-1}^T φ_t) φ_t    ← Dutch trace
θ_{t+1} ← θ_t + α δ_t e_t - α [V_t - V_{t-1}] φ_t  ← corrected update

True Online TD(λ) outperforms standard TD(λ) by maintaining the exact equivalence at every step.


Performance of TD(λ) vs TD(0) and MC

19-state random walk (S&B Figure 12.3):

RMS Error over 10 episodes (averaged):
  α\λ   0.0    0.3    0.5    0.7    0.9    1.0
  0.01  0.19   0.18   0.17   0.16   0.16   0.17
  0.1   0.17   0.15   0.13   0.12   0.12   0.16
  0.5   0.19   0.17   0.15   0.14   0.17   0.21

Best performance: around λ=0.7-0.9 — confirming intermediate λ is best. λ=0: TD(0) — good but not best; λ=1: MC — often worse due to high variance.


Eligibility Traces as Memory

Biological interpretation: eligibility traces resemble synaptic eligibility in neural plasticity (Hebbian learning, STDP). A synapse becomes “eligible” when activated; reward signal modulates how much this eligibility is converted to weight change. This matches the theoretical RL trace mechanism almost exactly.


Chapter 12 Summary

  1. Eligibility traces are decaying memory of recently-visited states: e_t(s) = γλ e_{t-1}(s) + 1[S_t=s]
  2. TD(λ) = backward view: update all states proportional to trace and TD error
  3. Forward-backward equivalence: TD(λ) computes the same as updating toward G_t^λ
  4. SARSA(λ): on-policy control with traces; faster credit assignment than SARSA
  5. Optimal λ ≈ 0.7–0.9 in practice; λ=0 → TD(0); λ=1 → MC
  6. Linear FA: trace = feature vector; O(d) per step → efficient
  7. True Online TD(λ): exact forward-backward equivalence at every step

Connection to DynamICCL

GAE in PPO is the function-approximation version of eligibility traces:

GAE Â_t = Σ_{l=0}^∞ (γλ)^l δ_{t+l}   ← forward view of TD(λ)
         ≡ eligibility trace accumulation   ← backward view

PPO implements this forward view explicitly (backward pass over trajectory).

For DynamICCL with linear state features, eligibility traces could replace GAE for more online (step-by-step) updates rather than batch rollout updates — trading latency for data efficiency.