Markov Decision Processes and Value Iteration

Chapter 17 — Making Complex Decisions Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 562–609

Sequential Decision Problems

When decisions must be made over time and each decision affects future options, we need to plan a policy — a mapping from states to actions.

Markov Decision Process (MDP) formalizes this: - States S, Actions A - Transition model P(s’ | s, a): probability of reaching s’ from s via action a - Reward function R(s, a, s’) or R(s): immediate reward - Discount factor γ ∈ [0,1]: present vs. future reward tradeoff

The Utility of State Sequences

Agent receives reward r_t at each time step t. Total utility:

Additive: Σ r_t (can be infinite if horizon is infinite)

Discounted: Σ γᵗ · r_t — converges for γ < 1

Average reward: lim_{T→∞} (1/T) Σ r_t — for infinite-horizon without discounting

Discounting γ controls: - γ = 0: myopic (only immediate reward) - γ → 1: far-sighted (future rewards equally important)

Optimal Policy and the Bellman Equations

Policy π: S → A. Value of policy π from state s:

V^π(s) = E[Σ γᵗ · R(Sₜ, π(Sₜ), Sₜ₊₁) | S₀=s, following π]

Optimal value function V*(s) = max_π V^π(s).

Bellman optimality equations (necessary and sufficient for optimal policy):

V*(s) = max_a Σ_{s'} P(s'|s,a) · [R(s,a,s') + γ · V*(s')]

Q*(s,a) = Σ_{s'} P(s'|s,a) · [R(s,a,s') + γ · max_{a'} Q*(s',a')]

Optimal policy: π(s) = argmax_a Q(s,a)

Value Iteration

Solve Bellman equations by iterative application (contraction mapping):

function VALUE-ITERATION(mdp, ε) returns utility function U
    U ← U' ← zero vector; δ ← 0
    loop do
        U ← U'
        δ ← 0
        for each state s in S do
            U'[s] ← R(s) + γ · max_a Σ_{s'} P(s'|s,a) · U[s']
            δ ← max(δ, |U'[s] - U[s]|)
    until δ < ε(1-γ)/γ    -- convergence criterion
    return U

Convergence

Contraction property: the Bellman update is a contraction mapping (factor γ):

||BU - BU'||_∞ ≤ γ · ||U - U'||_∞

→ Converges to unique fixed point V* regardless of initial values.

Error bound: if δ < ε(1-γ)/γ, then ||U - V*||_∞ < ε.

Complexity per iteration: O(|S|² · |A|) for dense transition model.

Policy Iteration

Alternative: work directly on the policy.

function POLICY-ITERATION(mdp) returns policy
    π ← arbitrary initial policy
    loop do
        -- Policy evaluation: compute V^π
        V ← POLICY-EVALUATION(π, mdp)
        -- Policy improvement: greedy step
        π' ← argmax_a Σ_{s'} P(s'|s,a) · [R(s,a,s') + γ · V[s']] for all s
        if π' = π then return π
        π ← π'

Policy evaluation: solve the linear system (|S| equations in |S| unknowns):

V^π(s) = R(s) + γ · Σ_{s'} P(s'|s,π(s)) · V^π(s')

Policy Iteration vs. Value Iteration

Property	Value Iteration	Policy Iteration
Number of iterations	Many (until convergence)	Few (often < 10)
Cost per iteration	O(	S
Practical	Good for large	S

Policy iteration converges in a finite number of iterations (bounded by |A|^|S|). Value iteration converges asymptotically.

Modified Policy Iteration

Hybrid: do k steps of value iteration for policy evaluation (instead of full solve). - k=1: value iteration - k=∞: policy iteration - k≈20: usually best in practice

Example: 4×3 Grid MDP

Classic AIMA example: - 4×3 grid; (1,1) = start; (4,3) = +1 reward; (4,2) = -1 reward; wall at (2,2) - Stochastic actions: 80% intended, 10% each perpendicular - γ = 0.9

Value iteration converges to a policy that avoids (4,2) and navigates to (4,3).

Connection to RL

Value iteration = dynamic programming (requires known P and R).

RL replaces P and R with experience samples → Q-learning, SARSA, actor-critic.

Bellman equations are the foundation of all value-based RL:

Q-learning: Q(s,a) ← Q(s,a) + α[r + γ·max_{a'} Q(s',a') - Q(s,a)]

DynamICCL: the RL policy IS a solution to an MDP where states = NCCL+network configurations, actions = parameter choices, rewards = throughput.