Policy Evaluation (Prediction)

Chapter 4 — Dynamic Programming

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 73–78

Overview of Dynamic Programming

Dynamic Programming (DP) refers to a collection of algorithms that compute optimal policies for MDPs, assuming a perfect model of the environment (i.e., p(s’, r | s, a) is known).

Why study DP? 1. Theoretical foundation: DP gives exact solutions; other methods are approximations of DP 2. Conceptual clarity: DP makes GPI explicit 3. Small-scale tool: can solve small MDPs exactly 4. Basis for approximate methods: temporal-difference (TD) methods replace exact backups with sample-based approximations

Limitations of DP: - Requires complete knowledge of MDP dynamics p - Sweeps over all states — O(|S|² · |A|) per iteration - Not scalable to large state spaces without modifications (async DP, sampling)

The Policy Evaluation Problem

Problem: given a policy π and MDP (S, A, P, R, γ), compute V^π: S → ℝ.

Bellman equation for V^π (derived in Ch.3):

V^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [r + γ V^π(s')]

This is a linear system in |S| unknowns. Can solve directly: O(|S|³). But iterative approach scales better and generalizes to approximate settings.

Iterative Policy Evaluation

Idea: define the Bellman operator T^π:

(T^π V)(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [r + γ V(s')]

Start with an arbitrary V_0 (typically V_0(s) = 0 for all s). Iterate:

V_{k+1} = T^π V_k

Convergence theorem: T^π is a γ-contraction in sup-norm:

||T^π V - T^π U||_∞ ≤ γ ||V - U||_∞

By Banach’s fixed-point theorem, V_k → V^π as k → ∞.

Convergence rate: geometric convergence with factor γ:

||V_k - V^π||_∞ ≤ γ^k ||V_0 - V^π||_∞

To achieve ε accuracy: need k ≥ log(ε/||V_0 - V^π||_∞) / log(γ) iterations.

For γ = 0.99, ε = 0.001, ||V_0 - V^π||_∞ ≈ 10: k ≥ log(0.0001)/log(0.99) ≈ 918 iterations.

Algorithm: Iterative Policy Evaluation

Input: MDP (S, A, P, R, γ), policy π, threshold θ > 0

Initialize: V(s) = 0 for all s ∈ S
            V(terminal) = 0

Loop:
  Δ ← 0
  For each s ∈ S:
    v ← V(s)
    V(s) ← Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [r + γ V(s')]   ← Bellman backup
    Δ ← max(Δ, |v - V(s)|)
  Until Δ < θ

Output: V ≈ V^π

Note on implementation: the algorithm uses in-place updates — V(s) is updated immediately and the updated value is used in subsequent computations in the same sweep. This is fine theoretically (still converges) and faster in practice.

Alternatively: maintain two arrays V_old and V_new (synchronous sweeps). Slightly slower but can be parallelized.

Worked Example: 4×4 Gridworld Policy Evaluation

Setup: 14 non-terminal states, 2 terminal states (corners), uniform random policy (each direction with prob 0.25), all rewards = -1, γ = 1.

Initial: V_0(s) = 0 for all s.

First iteration (k=1): For state 1 (adjacent to terminal 0):

V_1(1) = 0.25[(-1 + V_0(0)) + (-1 + V_0(5)) + (-1 + V_0(2)) + (-1 + V_0(1))]
       = 0.25[(-1+0) + (-1+0) + (-1+0) + (-1+0)]
       = -1.0

For state 5 (not adjacent to terminal):

V_1(5) = 0.25[(-1 + V_0(1)) + (-1 + V_0(9)) + (-1 + V_0(6)) + (-1 + V_0(4))]
       = -1.0

All non-terminal states: V_1(s) = -1 for all s (all receive one step of -1 reward).

Second iteration (k=2): For state 1:

V_2(1) = 0.25[(-1 + V_1(0)) + (-1 + V_1(5)) + (-1 + V_1(2)) + (-1 + V_1(1))]
       = 0.25[(-1+0) + (-1-1) + (-1-1) + (-1-1)]
       = 0.25[-1 - 2 - 2 - 2] = -1.75

After many iterations, V_k converges to V^π with values like:

V^π ≈ [0, -14, -20, -22,
      -14, -18, -20, -20,
      -20, -20, -18, -14,
      -22, -20, -14,   0]

The greedy policy w.r.t. V^π finds optimal paths to terminal states.

Convergence and Stopping Criteria

Exact convergence: V_k → V^π only asymptotically. In practice, stop when:

max_s |V_{k+1}(s) - V_k(s)| < θ    (e.g., θ = 0.001)

For policy iteration: we don’t need V^π exactly — just need to accurately identify which actions are better. Can stop early (θ larger) and still get correct policy improvement.

Efficient implementation for sparse MDPs: only sweep over states with non-trivial transitions. Skip absorbing states, unreachable states.

Two-Array vs In-Place Updates

In-place (single array)	Two-array (synchronous)
V(s) updated immediately	V_new(s) uses V_old throughout
Neighbors “see” updated values	All neighbors see same-iteration values
Converges faster in practice	Easier to analyze theoretically
Default in S&B algorithms	Needed for parallel implementations

For large problems: in-place is standard.

Connection to DynamICCL

Policy evaluation is not directly applied in DynamICCL (which uses model-free PPO), but the conceptual foundation is identical: given a fixed policy (e.g., “always use ring algorithm with chunk=1MB”), estimate V^π by iterating the Bellman backup. In DynamICCL, this evaluation step is replaced by a neural network value function V(s; θ) trained on rollout data — the sample-based approximation of exact DP evaluation.