Asynchronous DP, GPI, and Chapter 4 Summary
Chapter 4 — Dynamic Programming
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 88–98
Asynchronous Dynamic Programming
Problem with synchronous DP: each sweep updates all states — expensive for large |S|.
Asynchronous DP: update states in any order, using the most recent available values. Guarantees still hold as long as all states are updated infinitely often.
Three Asynchronous Strategies
1. In-Place DP (Asynchronous Sweeps)
Instead of maintaining V_old and V_new, update V(s) immediately:
V(s) ← max_a Σ_{s',r} p(s',r|s,a) [r + γ V(s')] ← uses latest V, not V_k
Effects: - Converges faster than synchronous (downstream states use updated upstream values) - Less memory (one array vs two) - Order of state updates matters for convergence speed (but not correctness)
Good ordering heuristic: update states in order of “how much they’ll change” — large |v - V(s)| first.
2. Prioritized Sweeping
Only update states where the Bellman error is large:
Priority queue: PQ ordered by |V(s) - max_a Σ p(s',r|s,a)[r+γV(s')]|
Loop:
s ← PQ.pop_max() ← state with largest Bellman error
V(s) ← max_a ... ← do the backup
For each s_prev that leads to s (predecessors):
recompute priority of s_prev, update PQ
Effect: focuses computation on states where the value function is most inaccurate — much faster convergence than uniform sweeping.
Connection to TD learning: prioritized experience replay in DQN uses the same principle — sample transitions with large Bellman error for training.
3. Real-Time DP (RTDP)
Only update states that the agent actually visits during interaction:
While agent is interacting:
Observe S_t
V(S_t) ← max_a Σ p(s',r|S_t,a)[r+γV(s')] ← update current state only
A_t ← argmax_a ... (greedy w.r.t. V)
Execute A_t, observe S_{t+1}, R_{t+1}
Advantage: naturally focuses on the relevant part of the state space — states that matter for the agent’s actual behavior. Unreachable states never updated.
Convergence: under certain conditions (all relevant states visited infinitely often), RTDP converges to V* restricted to the relevant state space.
This is the prototype for TD learning (Ch.6) — the key step is removing the model p(s’|s,a) and replacing the expected update with a single sample.
Efficiency Considerations
Computational Complexity
| Algorithm | Per-sweep cost | Sweeps to ε | Total |
|---|---|---|---|
| Synchronous policy eval | O(|S|²|A|) | O(log(1/ε)/log(1/γ)) | O(|S|²|A| log(1/ε)) |
| Synchronous value iter | O(|S|²|A|) | O(log(1/ε)/log(1/γ)) | Same |
| Async prioritized | O(|S|²|A|) | Fewer sweeps in practice | Better in practice |
| RTDP | O(|A|) per step | Until convergence on relevant states | Depends on problem |
For most real RL problems: |S| is too large for exact DP. Approximate methods (Part II) handle this.
State Space Tricks
State aggregation: group similar states and treat them as one:
V̂(g(s)) ≈ V*(s) where g: S → S_agg (state aggregation function)
Reduces |S| but introduces approximation error. Special case of linear function approximation (Ch.9).
Tile coding, basis functions: compact representations for continuous state spaces (Ch.9).
Generalized Policy Iteration (GPI): The Unified View
GPI is the central organizing concept of all RL:
┌──────────────────────────────────────────────────────┐
│ │
│ V^π ──────────── Policy Evaluation ────────────┐ │
│ ▲ │ │
│ │ ▼ │
│ │ π = greedy(V)
│ │ │ │
│ └──────────── Policy Improvement ─────────────┘ │
│ │
│ Fixed point: V = V^π AND π = greedy(V) ↔ V = V* │
└──────────────────────────────────────────────────────┘
GPI encompasses all RL algorithms:
| Algorithm | Evaluation | Improvement |
|---|---|---|
| Policy Iteration | Full (to convergence) | Full greedy |
| Value Iteration | 1 Bellman backup | Implicit greedy |
| Modified PI | k Bellman backups | Full greedy |
| Monte Carlo (Ch.5) | Episode-length rollout | ε-greedy |
| SARSA (Ch.6) | 1-step TD | ε-greedy (on-policy) |
| Q-learning (Ch.6) | 1-step TD | Full greedy (off-policy) |
| Actor-Critic (Ch.13) | TD(λ) | Policy gradient step |
The GPI convergence theorem: any algorithm that interleaves evaluation and improvement (with sufficient evaluation) converges to V* and π*.
Two Forces in GPI
The evaluation and improvement forces “oppose” each other but converge:
π = greedy(V)
/
Policy space ──── π* ──────── V*
\
V = V^π (evaluation)
- Evaluation (V → V^π): makes V consistent with π (if you follow π, this is what you get)
- Improvement (π → greedy(V)): makes π optimal with respect to current V
These two forces don’t point in the same direction generally, but the only stable fixed point is V* and π* — where both are consistent.
DP vs Exhaustive Search
Why is DP better than searching for the optimal policy directly?
Exhaustive search: try all |A|^|S| deterministic policies, evaluate each → infeasible.
DP’s key insight: dynamic programming principle (Bellman, 1957): > An optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
In other words: optimal substructure. V(s) can be computed from V(s’) — no need to re-solve sub-problems.
This reduces from exponential (exhaustive) to polynomial (DP) complexity.
Chapter 4 Summary
| Concept | Key Points |
|---|---|
| Policy evaluation | Iterative Bellman backup: V^π = unique fixed point of T^π |
| Policy improvement | Greedy w.r.t. V^π always improves or is already optimal |
| Policy iteration | Alternates evaluation + improvement; converges in finite steps |
| Value iteration | 1-step backup + implicit improvement; γ-contraction → convergence |
| Async DP | Update states selectively; RTDP focuses on visited states |
| GPI | Unifying framework; all RL = evaluation + improvement interleaved |
Limitations of DP (motivation for the rest of the book): 1. Requires known model p(s’,r|s,a) 2. Sweeps over all states — expensive for large S 3. Sample efficiency: uses the model directly, not actual experience
→ Monte Carlo (Ch.5): model-free, uses experience, episode-based → TD (Ch.6): model-free, uses experience, step-based (sample-based RTDP)
Connection to DynamICCL
| DP Concept | DynamICCL Analog |
|---|---|
| Policy evaluation | PPO rollout + GAE advantage estimation |
| Policy improvement | PPO gradient step (gradient of clipped surrogate) |
| GPI | PPO training loop: collect rollout → update policy → repeat |
| Async DP / RTDP | Online rollouts — only updates states DynamICCL actually visits |
| Prioritized sweeping | PER (Prioritized Experience Replay) in off-policy variants |
| Bellman optimality eq | Q-value target: y = r + γ max_a’ Q(s’, a’; θ_old) |