Value Functions and Bellman Equations
Motivation
The agent needs to evaluate states: “How good is it to be in state s?”
Good = expected future reward. But future depends on policy — so value is always defined with respect to a policy π.
State-Value Function V^π
V^π(s) = E_π[G_t | S_t = s]
= E_π[Σ_{k=0}^{∞} γᵏ R_{t+k+1} | S_t = s]
The expected return when starting in s and following policy π thereafter.
Terminal states: V^π(s_terminal) = 0 by convention.
V^π is unique for a given π and MDP — defined by the Bellman equation.
Action-Value Function Q^π
Q^π(s, a) = E_π[G_t | S_t = s, A_t = a]
= E_π[Σ_{k=0}^{∞} γᵏ R_{t+k+1} | S_t = s, A_t = a]
Expected return starting from s, taking action a, then following π.
Relationship: V^π(s) = Σ_a π(a|s) · Q^π(s, a)
Bellman Equation for V^π — Full Derivation
Starting from the definition:
V^π(s) = E_π[G_t | S_t = s]
= E_π[R_{t+1} + γG_{t+1} | S_t = s] -- recursive form
= Σ_a π(a|s) Σ_{s',r} p(s',r | s,a) [r + γE_π[G_{t+1}|S_{t+1}=s']]
= Σ_a π(a|s) Σ_{s',r} p(s',r | s,a) [r + γV^π(s')]
The Bellman equation for V^π:
V^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r | s,a) [r + γV^π(s')]
This is a self-consistency condition: V^π satisfies this system of |S| equations in |S| unknowns. For finite MDPs, this linear system has a unique solution.
Matrix Form
V^π = R^π + γ P^π V^π
where:
R^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) · r -- expected immediate reward
P^π(s,s') = Σ_a π(a|s) Σ_r p(s',r|s,a) -- transition matrix under π
Solving: V^π = (I - γP^π)^{-1} R^π (O(|S|³), direct but expensive)
Backup Diagram for V^π
● ← state s
/|\
/ | \ ← policy π selects actions (open circles)
○ ○ ○
| | | ← transition p(s',r|s,a)
● ● ● ← next states s'
The Bellman equation is exactly the one-step backup: V^π(s) is computed from the values of successor states.
Bellman Equation for Q^π
Q^π(s,a) = Σ_{s',r} p(s',r|s,a) [r + γ Σ_{a'} π(a'|s') Q^π(s',a')]
= Σ_{s',r} p(s',r|s,a) [r + γV^π(s')]
Optimal Value Functions
V*(s) — Optimal State Value
V*(s) = max_π V^π(s) = max_a Σ_{s',r} p(s',r|s,a) [r + γV*(s')]
Q*(s,a) — Optimal Action Value
Q*(s,a) = Σ_{s',r} p(s',r|s,a) [r + γ max_{a'} Q*(s',a')]
= Σ_{s',r} p(s',r|s,a) [r + γV*(s')]
Relationship
V*(s) = max_a Q*(s,a)
Q*(s,a) = Σ_{s',r} p(s',r|s,a) [r + γV*(s')]
Optimal Policy from V* or Q*
π*(s) = argmax_a Q*(s,a) -- from Q*: trivial (no model needed!)
π*(s) = argmax_a Σ_{s',r} p(s',r|s,a)[r+γV*(s')] -- from V*: needs model
Key insight: Q* is strictly more useful than V* for model-free methods — we can read off the optimal policy directly without knowing the dynamics p.
Backup Diagram for V*
● ← state s
|
max ← agent picks best action (filled arc = max)
/|\
○ ○ ○
| | |
● ● ● ← next states (environment random)
Compare to V^π: the max replaces the weighted sum over π.
Worked Example: 4×4 Gridworld
States: 1–14 (non-terminal), 0 and 15 (terminal). Policy π = uniform random (each direction with prob 0.25). All transitions to border stay in same cell. Reward = -1 per step except at terminal.
Bellman equation for state 1 (adjacent to terminal 0):
V^π(1) = 0.25·[(-1 + V^π(0)) ← go up → terminal
+(-1 + V^π(5)) ← go down
+(-1 + V^π(1)) ← go left → stay at 1
+(-1 + V^π(2))] ← go right
= 0.25·[-1 + (-1 + V^π(5)) + (-1 + V^π(1)) + (-1 + V^π(2))]
= -1 + 0.25·[V^π(0) + V^π(5) + V^π(1) + V^π(2)]
Solving the full 14-equation system gives V^π converging to values like V^π(1) ≈ -14 (average steps to reach a terminal square under random policy).
Properties of Bellman Equations
- Existence and uniqueness: for finite MDPs with γ < 1 (or episodic), both V^π and V* exist and are unique
- Contraction: the Bellman optimality operator T is a γ-contraction in the sup norm
- Fixed point: V* = T(V); iterating T from any initial V converges to V
- Linear system for V^π: (I - γPπ)Vπ = R^π; invertible since eigenvalues of γP^π have magnitude ≤ γ < 1