Value Iteration
Chapter 4 — Dynamic Programming
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 84–88
The Key Insight
Policy iteration problem: each evaluation phase requires many sweeps until V converges, which is expensive.
Key observation: we don’t need V^π to converge before improving π. We can interleave a single Bellman backup with policy improvement and still converge to V*.
Value iteration: combine evaluation (1 step) and improvement into a single update:
V_{k+1}(s) = max_a Σ_{s',r} p(s',r|s,a) [r + γ V_k(s')]
This is the Bellman optimality operator T* applied once:
V_{k+1} = T* V_k
Algorithm: Value Iteration
Input: MDP (S, A, P, R, γ), threshold θ > 0
Initialize: V(s) = 0 for all s ∈ S; V(terminal) = 0
Loop:
Δ ← 0
For each s ∈ S:
v ← V(s)
V(s) ← max_a Σ_{s',r} p(s',r|s,a) [r + γV(s')] ← Bellman optimality backup
Δ ← max(Δ, |v - V(s)|)
Until Δ < θ
Output: Deterministic policy π ≈ π*:
π(s) ← argmax_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]
Key difference from policy evaluation:
max_a instead of Σ_a π(a|s).
Convergence Proof
Theorem: V_k → V* as k → ∞.
Proof: T* is a γ-contraction:
||T*V - T*U||_∞ = max_s |max_a Σ p(s',r|s,a)[r+γV(s')] - max_a Σ p(s',r|s,a)[r+γU(s')]|
≤ max_s max_a |Σ p(s'|s,a) γ[V(s') - U(s')]|
≤ γ max_s |V(s) - U(s)|
= γ ||V - U||_∞
Therefore: ||V_k - V*||_∞ ≤ γ^k ||V_0 - V*||_∞ → 0 as k → ∞. ∎
Error bound for extracting policy: if ||V_k - V*||_∞ ≤ ε, then the greedy policy w.r.t. V_k has value:
||V^{π_k} - V*||_∞ ≤ 2γε / (1-γ)
So stop value iteration at ε = θ(1-γ)/(2γ) to guarantee policy value within θ of optimal.
Convergence Rate Comparison
| Algorithm | Per-iteration cost | Iterations to ε | Total cost |
|---|---|---|---|
| Policy Iteration | O( | S | ² |
| Value Iteration | O( | S | ² |
For γ = 0.99, ε = 0.001: value iteration needs ~log(10^6)/log(1.01) ≈ 1380 sweeps to ensure error ≤ 0.001. But each sweep is simpler (no nested evaluation), so in practice both methods are comparable.
Worked Example: Gambler’s Problem (S&B Example 4.3)
Setup: - Gambler bets on coin flips (p_h = probability of heads = 0.4) - State: capital s ∈ {1, 2, …, 99} (amount of money) - Actions: stake a ∈ {1, …, min(s, 100-s)} (can’t bet more than would bring to 100) - Reward: +1 if reach $100; 0 if reach $0; 0 otherwise - Terminal states: 0 (lose), 100 (win) - γ = 1
Value iteration on this problem:
V_0(s) = 0 for all s
V_{k+1}(s) = max_a [p_h · V_k(s + a) + (1-p_h) · V_k(s - a)]
Boundary conditions: - V(0) = 0 (lose) - V(100) = 1 (win)
Convergence (approximate values after many iterations):
V*(s) ≈ s/100 if p_h = 0.5 (fair coin)
V*(s) = complicated, non-monotone function for p_h ≠ 0.5
Optimal policy (for p_h = 0.4, unfair coin): bet big when you can! The optimal strategy bets maximally in certain states (reaching 50, 25, 75 as “safe” targets), creating a non-obvious staircase pattern.
s=1: bet 1 (if heads → 2, if tails → 0)
s=25: bet 25 (reach 50 or 0)
s=50: bet 50 (reach 100 or 0)
s=75: bet 25 (reach 100 or 50)
s=26: bet 24 (reach 50 or 2)
This counter-intuitive optimal policy (big bets for unfair coin) shows why value iteration is powerful — it discovers policies that are not obvious from intuition.
Value Iteration vs Policy Iteration: When to Use Which
Policy Iteration advantages: - Converges in very few policy iterations (3–10 for typical MDPs) - Each evaluation gives exact V^π, which can be reused - Better when |S| is small and linear system solve is fast
Value Iteration advantages: - Each sweep is simpler (no full evaluation) - Handles episodic tasks naturally (backward induction is value iteration from terminal) - More natural connection to sample-based methods (TD learning is sample-based value iteration)
Modified Policy Iteration (k evaluation sweeps before improvement): - Unifies both: k=1 → value iteration; k=∞ → policy iteration - k=10 often achieves policy iteration quality with near-value-iteration per-sweep cost
Relationship to Q-Value Iteration
Instead of iterating V, iterate Q:
Q_{k+1}(s, a) = Σ_{s',r} p(s',r|s,a) [r + γ max_{a'} Q_k(s', a')]
This is identical to value iteration but expressed in terms of Q. Advantage: policy extraction is trivial (argmax_a Q*(s,a) — no model needed). This is the model-free version; Q-learning (Ch.6) is the sample-based approximation.
Backward Induction for Episodic MDPs
For episodic MDPs with known finite horizon T, value iteration becomes exact backward induction:
V_T(s) = 0 ← terminal state
V_{T-1}(s) = max_a Σ p(s',r|s,a)[r + γ V_T(s')]
V_{T-2}(s) = max_a Σ p(s',r|s,a)[r + γ V_{T-1}(s')]
...
V_0 = V*
Terminates in exactly T iterations. Used in board games with known game lengths.
Connection to DynamICCL
Value iteration (T^k V_0 → V) is the theoretical ideal that DynamICCL’s Q-network is learning. Each PPO rollout + update is one noisy step toward Q*:
- Rollout: sample (s_t, a_t, r_t, s_{t+1}) from current policy
- Bellman target: y = r + γ max_{a’} Q(s’, a’; θ_old) — the value iteration update
- Gradient step: move Q(s, a; θ) toward y — stochastic gradient approximation of T*
The key insight: PPO/DQN trade exact Bellman backups (requiring p(s’|s,a)) for sample-based approximations. DynamICCL thus performs approximate value iteration through experience, without needing the NCCL dynamics model.