Value Iteration

Chapter 4 — Dynamic Programming

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 84–88


The Key Insight

Policy iteration problem: each evaluation phase requires many sweeps until V converges, which is expensive.

Key observation: we don’t need V^π to converge before improving π. We can interleave a single Bellman backup with policy improvement and still converge to V*.

Value iteration: combine evaluation (1 step) and improvement into a single update:

V_{k+1}(s) = max_a Σ_{s',r} p(s',r|s,a) [r + γ V_k(s')]

This is the Bellman optimality operator T* applied once:

V_{k+1} = T* V_k

Algorithm: Value Iteration

Input: MDP (S, A, P, R, γ), threshold θ > 0

Initialize: V(s) = 0 for all s ∈ S; V(terminal) = 0

Loop:
  Δ ← 0
  For each s ∈ S:
    v ← V(s)
    V(s) ← max_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]   ← Bellman optimality backup
    Δ ← max(Δ, |v - V(s)|)
  Until Δ < θ

Output: Deterministic policy π ≈ π*:
  π(s) ← argmax_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]

Key difference from policy evaluation: max_a instead of Σ_a π(a|s).


Convergence Proof

Theorem: V_k → V* as k → ∞.

Proof: T* is a γ-contraction:

||T*V - T*U||_∞ = max_s |max_a Σ p(s',r|s,a)[r+γV(s')] - max_a Σ p(s',r|s,a)[r+γU(s')]|
               ≤ max_s max_a |Σ p(s'|s,a) γ[V(s') - U(s')]|
               ≤ γ max_s |V(s) - U(s)|
               = γ ||V - U||_∞

Therefore: ||V_k - V*||_∞ ≤ γ^k ||V_0 - V*||_∞ → 0 as k → ∞. ∎

Error bound for extracting policy: if ||V_k - V*||_∞ ≤ ε, then the greedy policy w.r.t. V_k has value:

||V^{π_k} - V*||_∞ ≤ 2γε / (1-γ)

So stop value iteration at ε = θ(1-γ)/(2γ) to guarantee policy value within θ of optimal.


Convergence Rate Comparison

Algorithm Per-iteration cost Iterations to ε Total cost
Policy Iteration O( S ²
Value Iteration O( S ²

For γ = 0.99, ε = 0.001: value iteration needs ~log(10^6)/log(1.01) ≈ 1380 sweeps to ensure error ≤ 0.001. But each sweep is simpler (no nested evaluation), so in practice both methods are comparable.


Worked Example: Gambler’s Problem (S&B Example 4.3)

Setup: - Gambler bets on coin flips (p_h = probability of heads = 0.4) - State: capital s ∈ {1, 2, …, 99} (amount of money) - Actions: stake a ∈ {1, …, min(s, 100-s)} (can’t bet more than would bring to 100) - Reward: +1 if reach $100; 0 if reach $0; 0 otherwise - Terminal states: 0 (lose), 100 (win) - γ = 1

Value iteration on this problem:

V_0(s) = 0 for all s
V_{k+1}(s) = max_a [p_h · V_k(s + a) + (1-p_h) · V_k(s - a)]

Boundary conditions: - V(0) = 0 (lose) - V(100) = 1 (win)

Convergence (approximate values after many iterations):

V*(s) ≈ s/100 if p_h = 0.5 (fair coin)
V*(s) = complicated, non-monotone function for p_h ≠ 0.5

Optimal policy (for p_h = 0.4, unfair coin): bet big when you can! The optimal strategy bets maximally in certain states (reaching 50, 25, 75 as “safe” targets), creating a non-obvious staircase pattern.

s=1:   bet 1 (if heads → 2, if tails → 0)
s=25:  bet 25 (reach 50 or 0)
s=50:  bet 50 (reach 100 or 0)
s=75:  bet 25 (reach 100 or 50)
s=26:  bet 24 (reach 50 or 2)

This counter-intuitive optimal policy (big bets for unfair coin) shows why value iteration is powerful — it discovers policies that are not obvious from intuition.


Value Iteration vs Policy Iteration: When to Use Which

Policy Iteration advantages: - Converges in very few policy iterations (3–10 for typical MDPs) - Each evaluation gives exact V^π, which can be reused - Better when |S| is small and linear system solve is fast

Value Iteration advantages: - Each sweep is simpler (no full evaluation) - Handles episodic tasks naturally (backward induction is value iteration from terminal) - More natural connection to sample-based methods (TD learning is sample-based value iteration)

Modified Policy Iteration (k evaluation sweeps before improvement): - Unifies both: k=1 → value iteration; k=∞ → policy iteration - k=10 often achieves policy iteration quality with near-value-iteration per-sweep cost


Relationship to Q-Value Iteration

Instead of iterating V, iterate Q:

Q_{k+1}(s, a) = Σ_{s',r} p(s',r|s,a) [r + γ max_{a'} Q_k(s', a')]

This is identical to value iteration but expressed in terms of Q. Advantage: policy extraction is trivial (argmax_a Q*(s,a) — no model needed). This is the model-free version; Q-learning (Ch.6) is the sample-based approximation.


Backward Induction for Episodic MDPs

For episodic MDPs with known finite horizon T, value iteration becomes exact backward induction:

V_T(s) = 0                    ← terminal state
V_{T-1}(s) = max_a Σ p(s',r|s,a)[r + γ V_T(s')]
V_{T-2}(s) = max_a Σ p(s',r|s,a)[r + γ V_{T-1}(s')]
...
V_0 = V*

Terminates in exactly T iterations. Used in board games with known game lengths.


Connection to DynamICCL

Value iteration (T^k V_0 → V) is the theoretical ideal that DynamICCL’s Q-network is learning. Each PPO rollout + update is one noisy step toward Q*:

  1. Rollout: sample (s_t, a_t, r_t, s_{t+1}) from current policy
  2. Bellman target: y = r + γ max_{a’} Q(s’, a’; θ_old) — the value iteration update
  3. Gradient step: move Q(s, a; θ) toward y — stochastic gradient approximation of T*

The key insight: PPO/DQN trade exact Bellman backups (requiring p(s’|s,a)) for sample-based approximations. DynamICCL thus performs approximate value iteration through experience, without needing the NCCL dynamics model.