Policy Evaluation (Prediction)
Chapter 4 — Dynamic Programming
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 73–78
Overview of Dynamic Programming
Dynamic Programming (DP) refers to a collection of algorithms that compute optimal policies for MDPs, assuming a perfect model of the environment (i.e., p(s’, r | s, a) is known).
Why study DP? 1. Theoretical foundation: DP gives exact solutions; other methods are approximations of DP 2. Conceptual clarity: DP makes GPI explicit 3. Small-scale tool: can solve small MDPs exactly 4. Basis for approximate methods: temporal-difference (TD) methods replace exact backups with sample-based approximations
Limitations of DP: - Requires complete knowledge of MDP dynamics p - Sweeps over all states — O(|S|² · |A|) per iteration - Not scalable to large state spaces without modifications (async DP, sampling)
The Policy Evaluation Problem
Problem: given a policy π and MDP (S, A, P, R, γ), compute V^π: S → ℝ.
Bellman equation for V^π (derived in Ch.3):
V^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [r + γ V^π(s')]
This is a linear system in |S| unknowns. Can solve directly: O(|S|³). But iterative approach scales better and generalizes to approximate settings.
Iterative Policy Evaluation
Idea: define the Bellman operator T^π:
(T^π V)(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [r + γ V(s')]
Start with an arbitrary V_0 (typically V_0(s) = 0 for all s). Iterate:
V_{k+1} = T^π V_k
Convergence theorem: T^π is a γ-contraction in sup-norm:
||T^π V - T^π U||_∞ ≤ γ ||V - U||_∞
By Banach’s fixed-point theorem, V_k → V^π as k → ∞.
Convergence rate: geometric convergence with factor γ:
||V_k - V^π||_∞ ≤ γ^k ||V_0 - V^π||_∞
To achieve ε accuracy: need k ≥ log(ε/||V_0 - V^π||_∞) / log(γ) iterations.
For γ = 0.99, ε = 0.001, ||V_0 - V^π||_∞ ≈ 10: k ≥ log(0.0001)/log(0.99) ≈ 918 iterations.
Algorithm: Iterative Policy Evaluation
Input: MDP (S, A, P, R, γ), policy π, threshold θ > 0
Initialize: V(s) = 0 for all s ∈ S
V(terminal) = 0
Loop:
Δ ← 0
For each s ∈ S:
v ← V(s)
V(s) ← Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [r + γ V(s')] ← Bellman backup
Δ ← max(Δ, |v - V(s)|)
Until Δ < θ
Output: V ≈ V^π
Note on implementation: the algorithm uses in-place updates — V(s) is updated immediately and the updated value is used in subsequent computations in the same sweep. This is fine theoretically (still converges) and faster in practice.
Alternatively: maintain two arrays V_old and V_new (synchronous sweeps). Slightly slower but can be parallelized.
Worked Example: 4×4 Gridworld Policy Evaluation
Setup: 14 non-terminal states, 2 terminal states (corners), uniform random policy (each direction with prob 0.25), all rewards = -1, γ = 1.
Initial: V_0(s) = 0 for all s.
First iteration (k=1): For state 1 (adjacent to terminal 0):
V_1(1) = 0.25[(-1 + V_0(0)) + (-1 + V_0(5)) + (-1 + V_0(2)) + (-1 + V_0(1))]
= 0.25[(-1+0) + (-1+0) + (-1+0) + (-1+0)]
= -1.0
For state 5 (not adjacent to terminal):
V_1(5) = 0.25[(-1 + V_0(1)) + (-1 + V_0(9)) + (-1 + V_0(6)) + (-1 + V_0(4))]
= -1.0
All non-terminal states: V_1(s) = -1 for all s (all receive one step of -1 reward).
Second iteration (k=2): For state 1:
V_2(1) = 0.25[(-1 + V_1(0)) + (-1 + V_1(5)) + (-1 + V_1(2)) + (-1 + V_1(1))]
= 0.25[(-1+0) + (-1-1) + (-1-1) + (-1-1)]
= 0.25[-1 - 2 - 2 - 2] = -1.75
After many iterations, V_k converges to V^π with values like:
V^π ≈ [0, -14, -20, -22,
-14, -18, -20, -20,
-20, -20, -18, -14,
-22, -20, -14, 0]
The greedy policy w.r.t. V^π finds optimal paths to terminal states.
Convergence and Stopping Criteria
Exact convergence: V_k → V^π only asymptotically. In practice, stop when:
max_s |V_{k+1}(s) - V_k(s)| < θ (e.g., θ = 0.001)
For policy iteration: we don’t need V^π exactly — just need to accurately identify which actions are better. Can stop early (θ larger) and still get correct policy improvement.
Efficient implementation for sparse MDPs: only sweep over states with non-trivial transitions. Skip absorbing states, unreachable states.
Two-Array vs In-Place Updates
| In-place (single array) | Two-array (synchronous) |
|---|---|
| V(s) updated immediately | V_new(s) uses V_old throughout |
| Neighbors “see” updated values | All neighbors see same-iteration values |
| Converges faster in practice | Easier to analyze theoretically |
| Default in S&B algorithms | Needed for parallel implementations |
For large problems: in-place is standard.
Connection to DynamICCL
Policy evaluation is not directly applied in DynamICCL (which uses model-free PPO), but the conceptual foundation is identical: given a fixed policy (e.g., “always use ring algorithm with chunk=1MB”), estimate V^π by iterating the Bellman backup. In DynamICCL, this evaluation step is replaced by a neural network value function V(s; θ) trained on rollout data — the sample-based approximation of exact DP evaluation.