Asynchronous DP, GPI, and Chapter 4 Summary

Chapter 4 — Dynamic Programming

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 88–98

Asynchronous Dynamic Programming

Problem with synchronous DP: each sweep updates all states — expensive for large |S|.

Asynchronous DP: update states in any order, using the most recent available values. Guarantees still hold as long as all states are updated infinitely often.

Three Asynchronous Strategies

1. In-Place DP (Asynchronous Sweeps)

Instead of maintaining V_old and V_new, update V(s) immediately:

V(s) ← max_a Σ_{s',r} p(s',r|s,a) [r + γ V(s')]   ← uses latest V, not V_k

Effects: - Converges faster than synchronous (downstream states use updated upstream values) - Less memory (one array vs two) - Order of state updates matters for convergence speed (but not correctness)

Good ordering heuristic: update states in order of “how much they’ll change” — large |v - V(s)| first.

2. Prioritized Sweeping

Only update states where the Bellman error is large:

Priority queue: PQ ordered by |V(s) - max_a Σ p(s',r|s,a)[r+γV(s')]|

Loop:
  s ← PQ.pop_max()     ← state with largest Bellman error
  V(s) ← max_a ...     ← do the backup
  For each s_prev that leads to s (predecessors):
    recompute priority of s_prev, update PQ

Effect: focuses computation on states where the value function is most inaccurate — much faster convergence than uniform sweeping.

Connection to TD learning: prioritized experience replay in DQN uses the same principle — sample transitions with large Bellman error for training.

3. Real-Time DP (RTDP)

Only update states that the agent actually visits during interaction:

While agent is interacting:
  Observe S_t
  V(S_t) ← max_a Σ p(s',r|S_t,a)[r+γV(s')]    ← update current state only
  A_t ← argmax_a ... (greedy w.r.t. V)
  Execute A_t, observe S_{t+1}, R_{t+1}

Advantage: naturally focuses on the relevant part of the state space — states that matter for the agent’s actual behavior. Unreachable states never updated.

Convergence: under certain conditions (all relevant states visited infinitely often), RTDP converges to V* restricted to the relevant state space.

This is the prototype for TD learning (Ch.6) — the key step is removing the model p(s’|s,a) and replacing the expected update with a single sample.

Efficiency Considerations

Computational Complexity

Algorithm	Per-sweep cost	Sweeps to ε	Total
Synchronous policy eval	O(\|S\|²\|A\|)	O(log(1/ε)/log(1/γ))	O(\|S\|²\|A\| log(1/ε))
Synchronous value iter	O(\|S\|²\|A\|)	O(log(1/ε)/log(1/γ))	Same
Async prioritized	O(\|S\|²\|A\|)	Fewer sweeps in practice	Better in practice
RTDP	O(\|A\|) per step	Until convergence on relevant states	Depends on problem

For most real RL problems: |S| is too large for exact DP. Approximate methods (Part II) handle this.

State Space Tricks

State aggregation: group similar states and treat them as one:

V̂(g(s)) ≈ V*(s)  where g: S → S_agg (state aggregation function)

Reduces |S| but introduces approximation error. Special case of linear function approximation (Ch.9).

Tile coding, basis functions: compact representations for continuous state spaces (Ch.9).

Generalized Policy Iteration (GPI): The Unified View

GPI is the central organizing concept of all RL:

  ┌──────────────────────────────────────────────────────┐
  │                                                       │
  │   V^π ──────────── Policy Evaluation ────────────┐   │
  │    ▲                                              │   │
  │    │                                              ▼   │
  │    │                                         π = greedy(V)
  │    │                                              │   │
  │    └──────────── Policy Improvement ─────────────┘   │
  │                                                       │
  │  Fixed point: V = V^π AND π = greedy(V) ↔ V = V*     │
  └──────────────────────────────────────────────────────┘

GPI encompasses all RL algorithms:

Algorithm	Evaluation	Improvement
Policy Iteration	Full (to convergence)	Full greedy
Value Iteration	1 Bellman backup	Implicit greedy
Modified PI	k Bellman backups	Full greedy
Monte Carlo (Ch.5)	Episode-length rollout	ε-greedy
SARSA (Ch.6)	1-step TD	ε-greedy (on-policy)
Q-learning (Ch.6)	1-step TD	Full greedy (off-policy)
Actor-Critic (Ch.13)	TD(λ)	Policy gradient step

The GPI convergence theorem: any algorithm that interleaves evaluation and improvement (with sufficient evaluation) converges to V* and π*.

Two Forces in GPI

The evaluation and improvement forces “oppose” each other but converge:

                    π = greedy(V)
                   /
Policy space ──── π* ──────── V*
                              \
                         V = V^π  (evaluation)

Evaluation (V → V^π): makes V consistent with π (if you follow π, this is what you get)
Improvement (π → greedy(V)): makes π optimal with respect to current V

These two forces don’t point in the same direction generally, but the only stable fixed point is V* and π* — where both are consistent.

DP vs Exhaustive Search

Why is DP better than searching for the optimal policy directly?

Exhaustive search: try all |A|^|S| deterministic policies, evaluate each → infeasible.

DP’s key insight: dynamic programming principle (Bellman, 1957): > An optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

In other words: optimal substructure. V(s) can be computed from V(s’) — no need to re-solve sub-problems.

This reduces from exponential (exhaustive) to polynomial (DP) complexity.

Chapter 4 Summary

Concept	Key Points
Policy evaluation	Iterative Bellman backup: V^π = unique fixed point of T^π
Policy improvement	Greedy w.r.t. V^π always improves or is already optimal
Policy iteration	Alternates evaluation + improvement; converges in finite steps
Value iteration	1-step backup + implicit improvement; γ-contraction → convergence
Async DP	Update states selectively; RTDP focuses on visited states
GPI	Unifying framework; all RL = evaluation + improvement interleaved

Limitations of DP (motivation for the rest of the book): 1. Requires known model p(s’,r|s,a) 2. Sweeps over all states — expensive for large S 3. Sample efficiency: uses the model directly, not actual experience

→ Monte Carlo (Ch.5): model-free, uses experience, episode-based → TD (Ch.6): model-free, uses experience, step-based (sample-based RTDP)

Connection to DynamICCL

DP Concept	DynamICCL Analog
Policy evaluation	PPO rollout + GAE advantage estimation
Policy improvement	PPO gradient step (gradient of clipped surrogate)
GPI	PPO training loop: collect rollout → update policy → repeat
Async DP / RTDP	Online rollouts — only updates states DynamICCL actually visits
Prioritized sweeping	PER (Prioritized Experience Replay) in off-policy variants
Bellman optimality eq	Q-value target: y = r + γ max_a’ Q(s’, a’; θ_old)