Markov Decision Processes and Value Iteration
Chapter 17 — Making Complex Decisions Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 562–609
Sequential Decision Problems
When decisions must be made over time and each decision affects future options, we need to plan a policy — a mapping from states to actions.
Markov Decision Process (MDP) formalizes this: - States S, Actions A - Transition model P(s’ | s, a): probability of reaching s’ from s via action a - Reward function R(s, a, s’) or R(s): immediate reward - Discount factor γ ∈ [0,1]: present vs. future reward tradeoff
The Utility of State Sequences
Agent receives reward r_t at each time step t. Total utility:
Additive: Σ r_t (can be infinite if horizon is infinite)
Discounted: Σ γᵗ · r_t — converges for γ < 1
Average reward: lim_{T→∞} (1/T) Σ r_t — for infinite-horizon without discounting
Discounting γ controls: - γ = 0: myopic (only immediate reward) - γ → 1: far-sighted (future rewards equally important)
Optimal Policy and the Bellman Equations
Policy π: S → A. Value of policy π from state s:
V^π(s) = E[Σ γᵗ · R(Sₜ, π(Sₜ), Sₜ₊₁) | S₀=s, following π]
Optimal value function V*(s) = max_π V^π(s).
Bellman optimality equations (necessary and sufficient for optimal policy):
V*(s) = max_a Σ_{s'} P(s'|s,a) · [R(s,a,s') + γ · V*(s')]
Q*(s,a) = Σ_{s'} P(s'|s,a) · [R(s,a,s') + γ · max_{a'} Q*(s',a')]
Optimal policy: π(s) = argmax_a Q(s,a)
Value Iteration
Solve Bellman equations by iterative application (contraction mapping):
function VALUE-ITERATION(mdp, ε) returns utility function U
U ← U' ← zero vector; δ ← 0
loop do
U ← U'
δ ← 0
for each state s in S do
U'[s] ← R(s) + γ · max_a Σ_{s'} P(s'|s,a) · U[s']
δ ← max(δ, |U'[s] - U[s]|)
until δ < ε(1-γ)/γ -- convergence criterion
return U
Convergence
Contraction property: the Bellman update is a contraction mapping (factor γ):
||BU - BU'||_∞ ≤ γ · ||U - U'||_∞
→ Converges to unique fixed point V* regardless of initial values.
Error bound: if δ < ε(1-γ)/γ, then ||U - V*||_∞ < ε.
Complexity per iteration: O(|S|² · |A|) for dense transition model.
Policy Iteration
Alternative: work directly on the policy.
function POLICY-ITERATION(mdp) returns policy
π ← arbitrary initial policy
loop do
-- Policy evaluation: compute V^π
V ← POLICY-EVALUATION(π, mdp)
-- Policy improvement: greedy step
π' ← argmax_a Σ_{s'} P(s'|s,a) · [R(s,a,s') + γ · V[s']] for all s
if π' = π then return π
π ← π'
Policy evaluation: solve the linear system (|S| equations in |S| unknowns):
V^π(s) = R(s) + γ · Σ_{s'} P(s'|s,π(s)) · V^π(s')
Policy Iteration vs. Value Iteration
| Property | Value Iteration | Policy Iteration |
|---|---|---|
| Number of iterations | Many (until convergence) | Few (often < 10) |
| Cost per iteration | O( | S |
| Practical | Good for large | S |
Policy iteration converges in a finite number of iterations (bounded by |A|^|S|). Value iteration converges asymptotically.
Modified Policy Iteration
Hybrid: do k steps of value iteration for policy evaluation (instead of full solve). - k=1: value iteration - k=∞: policy iteration - k≈20: usually best in practice
Example: 4×3 Grid MDP
Classic AIMA example: - 4×3 grid; (1,1) = start; (4,3) = +1 reward; (4,2) = -1 reward; wall at (2,2) - Stochastic actions: 80% intended, 10% each perpendicular - γ = 0.9
Value iteration converges to a policy that avoids (4,2) and navigates to (4,3).
Connection to RL
Value iteration = dynamic programming (requires known P and R).
RL replaces P and R with experience samples → Q-learning, SARSA, actor-critic.
Bellman equations are the foundation of all value-based RL:
Q-learning: Q(s,a) ← Q(s,a) + α[r + γ·max_{a'} Q(s',a') - Q(s,a)]
DynamICCL: the RL policy IS a solution to an MDP where states = NCCL+network configurations, actions = parameter choices, rewards = throughput.