Policy Improvement and Policy Iteration
Chapter 4 — Dynamic Programming
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 78–84
The Policy Improvement Theorem
Given a policy π and its value function V^π, consider a new policy π’ defined by:
π'(s) = argmax_a Q^π(s, a)
= argmax_a Σ_{s',r} p(s',r|s,a) [r + γ V^π(s')]
Policy Improvement Theorem: for all s ∈ S:
V^{π'}(s) ≥ V^π(s)
Proof:
V^π(s) ≤ Q^π(s, π'(s)) ← π' is greedy (≥ current policy)
= Σ_{s',r} p(s',r|s,π'(s)) [r + γV^π(s')]
≤ Σ_{s',r} p(s',r|s,π'(s)) [r + γQ^π(s', π'(s'))] ← apply same argument at s'
= ...
≤ V^{π'}(s) ∎
Strong improvement: if π’(s) ≠ π(s) for even one state s where Q^π(s, π’(s)) > V^π(s), then π’ strictly improves π.
Termination condition: if π’ = π (no state has a strictly better action), then V^π = V* and π is already optimal.
Policy Iteration Algorithm
Idea: alternate between evaluation and improvement until convergence.
Algorithm: Policy Iteration
1. Initialize: π(s) = arbitrary; V(s) = 0 for all s
2. Policy Evaluation:
Repeat until Δ < θ:
For each s ∈ S:
v ← V(s)
V(s) ← Σ_{s',r} p(s',r|s,π(s)) [r + γV(s')]
Δ ← max(Δ, |v - V(s)|)
3. Policy Improvement:
policy_stable ← True
For each s ∈ S:
old_action ← π(s)
π(s) ← argmax_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]
If old_action ≠ π(s): policy_stable ← False
4. If policy_stable: return V, π ← optimal!
Else: go to step 2
Convergence: the sequence of policies π_0, π_1, π_2, … is strictly improving (V^{π_{k+1}} ≥ V^{π_k}) and the policy space is finite (|S|^|A| policies). Therefore, algorithm terminates in finite steps.
In practice: converges in very few iterations (often 3–10), even for large state spaces.
Worked Example: Jack’s Car Rental (S&B Example 4.2)
Setup: - Two rental locations; each can hold max 20 cars - Each day: requests ~ Poisson(λ_1=3) at location 1, Poisson(λ_2=4) at location 2 - Returns ~ Poisson(μ_1=3), Poisson(μ_2=2) - Revenue: $10 per rented car; Cost: $2 to move a car overnight (up to 5 cars) - State: (n_1, n_2) = cars at each location (21×21 = 441 states)
Policy iteration progression:
π_0: Move 0 cars (do nothing)
π_1: Move some cars east (location 2 has higher demand)
π_2: More aggressive redistribution
π_3: Optimal (converged after ~4 iterations)
Optimal policy: move up to 5 cars from location 1 to 2 when location 1 has excess, up to 3 cars from 2 to 1 when location 2 has excess. The asymmetry reflects different demand/return rates.
Value function: smooth surface over (n_1, n_2) — more cars at high-demand location 2 → higher value.
Why Policy Iteration Converges Fast
Policy iteration is fast because: 1. Each policy improvement step is greedy — guaranteed to strictly improve unless optimal 2. Number of deterministic policies is |A|^|S| — but typically only a small fraction are visited before convergence 3. Each evaluation computes V^π exactly (or approximately) — rare case where full evaluation is worth the cost
Compare to value iteration: value iteration does one Bellman backup per sweep (truncated evaluation), which is O(|S|²|A|) per iteration but requires more iterations than policy iteration. For small |S|, policy iteration often wins total computation time.
Modified Policy Iteration
Full policy evaluation to convergence before each improvement step is expensive (many sweeps).
Modified policy iteration: do only k steps of policy evaluation before improving: - k = 1: value iteration (see next file) - k = ∞: classical policy iteration - k = 10 or 100: typically as good as k = ∞ but faster
Generalized Policy Iteration (GPI):
- Evaluation: any number of Bellman backups (1 to ∞)
- Improvement: greedy or ε-greedy w.r.t. current V
- Any combination converges to V* and π*
This flexibility is what enables the wide variety of RL algorithms — they all implement GPI with different tradeoffs between evaluation depth and improvement frequency.
Greedy Improvement as One-Step Lookahead
Policy improvement = one-step lookahead greedy action selection:
π'(s) = argmax_a [r(s,a) + γ Σ_{s'} p(s'|s,a) V^π(s')]
The agent acts as if it will follow π forever from s’ but takes the best immediate action. After improvement, the new policy π’ is generally better everywhere.
For stochastic policies: the improvement theorem also holds for a “soft” greedy:
π'(a|s) = (1-ε)·1[a = argmax_b Q^π(s,b)] + ε/|A|
This is ε-greedy improvement — used in on-policy approximate control (SARSA, Ch.6).
Value Function Geometry
Policy iteration can be visualized geometrically:
Policy space: finite set of points
Value space: ℝ^|S|
Evaluation: given π → move to V^π (unique point in value space)
Improvement: given V^π → move to π' = greedy(V^π) in policy space
GPI path: π_0 → V^{π_0} → π_1 → V^{π_1} → π_2 → ... → π* → V*
The two “forces” (eval pulls V toward V^π; improve pulls π toward greedy(V)) converge to the unique fixed point (V, π).
Connection to DynamICCL
Policy iteration is the conceptual prototype for DynamICCL’s PPO updates: 1. Evaluation phase: collect rollouts with current policy π; compute advantages A(s,a) ≈ Q^π(s,a) - V^π(s) using GAE 2. Improvement phase: take gradient steps on the PPO objective to increase probability of high-advantage actions (greedy improvement in function space)
The key difference: PPO doesn’t wait for full evaluation convergence — it does a few gradient steps on a minibatch (modified policy iteration with k=1 or k=few). This is exactly GPI with truncated evaluation, proven effective in practice.