Policy Improvement and Policy Iteration

Chapter 4 — Dynamic Programming

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 78–84

The Policy Improvement Theorem

Given a policy π and its value function V^π, consider a new policy π’ defined by:

π'(s) = argmax_a Q^π(s, a)
       = argmax_a Σ_{s',r} p(s',r|s,a) [r + γ V^π(s')]

Policy Improvement Theorem: for all s ∈ S:

V^{π'}(s) ≥ V^π(s)

Proof:

V^π(s) ≤ Q^π(s, π'(s))                              ← π' is greedy (≥ current policy)
        = Σ_{s',r} p(s',r|s,π'(s)) [r + γV^π(s')]
        ≤ Σ_{s',r} p(s',r|s,π'(s)) [r + γQ^π(s', π'(s'))]   ← apply same argument at s'
        = ...
        ≤ V^{π'}(s)                                   ∎

Strong improvement: if π’(s) ≠ π(s) for even one state s where Q^π(s, π’(s)) > V^π(s), then π’ strictly improves π.

Termination condition: if π’ = π (no state has a strictly better action), then V^π = V* and π is already optimal.

Policy Iteration Algorithm

Idea: alternate between evaluation and improvement until convergence.

Algorithm: Policy Iteration

1. Initialize: π(s) = arbitrary; V(s) = 0 for all s

2. Policy Evaluation:
   Repeat until Δ < θ:
     For each s ∈ S:
       v ← V(s)
       V(s) ← Σ_{s',r} p(s',r|s,π(s)) [r + γV(s')]
       Δ ← max(Δ, |v - V(s)|)

3. Policy Improvement:
   policy_stable ← True
   For each s ∈ S:
     old_action ← π(s)
     π(s) ← argmax_a Σ_{s',r} p(s',r|s,a) [r + γV(s')]
     If old_action ≠ π(s): policy_stable ← False

4. If policy_stable: return V, π   ← optimal!
   Else: go to step 2

Convergence: the sequence of policies π_0, π_1, π_2, … is strictly improving (V^{π_{k+1}} ≥ V^{π_k}) and the policy space is finite (|S|^|A| policies). Therefore, algorithm terminates in finite steps.

In practice: converges in very few iterations (often 3–10), even for large state spaces.

Worked Example: Jack’s Car Rental (S&B Example 4.2)

Setup: - Two rental locations; each can hold max 20 cars - Each day: requests ~ Poisson(λ_1=3) at location 1, Poisson(λ_2=4) at location 2 - Returns ~ Poisson(μ_1=3), Poisson(μ_2=2) - Revenue: $10 per rented car; Cost: $2 to move a car overnight (up to 5 cars) - State: (n_1, n_2) = cars at each location (21×21 = 441 states)

Policy iteration progression:

π_0: Move 0 cars (do nothing)
π_1: Move some cars east (location 2 has higher demand)
π_2: More aggressive redistribution
π_3: Optimal (converged after ~4 iterations)

Optimal policy: move up to 5 cars from location 1 to 2 when location 1 has excess, up to 3 cars from 2 to 1 when location 2 has excess. The asymmetry reflects different demand/return rates.

Value function: smooth surface over (n_1, n_2) — more cars at high-demand location 2 → higher value.

Why Policy Iteration Converges Fast

Policy iteration is fast because: 1. Each policy improvement step is greedy — guaranteed to strictly improve unless optimal 2. Number of deterministic policies is |A|^|S| — but typically only a small fraction are visited before convergence 3. Each evaluation computes V^π exactly (or approximately) — rare case where full evaluation is worth the cost

Compare to value iteration: value iteration does one Bellman backup per sweep (truncated evaluation), which is O(|S|²|A|) per iteration but requires more iterations than policy iteration. For small |S|, policy iteration often wins total computation time.

Modified Policy Iteration

Full policy evaluation to convergence before each improvement step is expensive (many sweeps).

Modified policy iteration: do only k steps of policy evaluation before improving: - k = 1: value iteration (see next file) - k = ∞: classical policy iteration - k = 10 or 100: typically as good as k = ∞ but faster

Generalized Policy Iteration (GPI):
  - Evaluation: any number of Bellman backups (1 to ∞)
  - Improvement: greedy or ε-greedy w.r.t. current V
  - Any combination converges to V* and π*

This flexibility is what enables the wide variety of RL algorithms — they all implement GPI with different tradeoffs between evaluation depth and improvement frequency.

Greedy Improvement as One-Step Lookahead

Policy improvement = one-step lookahead greedy action selection:

π'(s) = argmax_a [r(s,a) + γ Σ_{s'} p(s'|s,a) V^π(s')]

The agent acts as if it will follow π forever from s’ but takes the best immediate action. After improvement, the new policy π’ is generally better everywhere.

For stochastic policies: the improvement theorem also holds for a “soft” greedy:

π'(a|s) = (1-ε)·1[a = argmax_b Q^π(s,b)] + ε/|A|

This is ε-greedy improvement — used in on-policy approximate control (SARSA, Ch.6).

Value Function Geometry

Policy iteration can be visualized geometrically:

Policy space: finite set of points
Value space: ℝ^|S|

Evaluation:  given π → move to V^π (unique point in value space)
Improvement: given V^π → move to π' = greedy(V^π) in policy space

GPI path: π_0 → V^{π_0} → π_1 → V^{π_1} → π_2 → ... → π* → V*

The two “forces” (eval pulls V toward V^π; improve pulls π toward greedy(V)) converge to the unique fixed point (V, π).

Connection to DynamICCL

Policy iteration is the conceptual prototype for DynamICCL’s PPO updates: 1. Evaluation phase: collect rollouts with current policy π; compute advantages A(s,a) ≈ Q^π(s,a) - V^π(s) using GAE 2. Improvement phase: take gradient steps on the PPO objective to increase probability of high-advantage actions (greedy improvement in function space)

The key difference: PPO doesn’t wait for full evaluation convergence — it does a few gradient steps on a minibatch (modified policy iteration with k=1 or k=few). This is exactly GPI with truncated evaluation, proven effective in practice.