Value Functions and Bellman Equations

Motivation

The agent needs to evaluate states: “How good is it to be in state s?”

Good = expected future reward. But future depends on policy — so value is always defined with respect to a policy π.


State-Value Function V^π

V^π(s) = E_π[G_t | S_t = s]
        = E_π[Σ_{k=0}^{∞} γᵏ R_{t+k+1} | S_t = s]

The expected return when starting in s and following policy π thereafter.

Terminal states: V^π(s_terminal) = 0 by convention.

V^π is unique for a given π and MDP — defined by the Bellman equation.


Action-Value Function Q^π

Q^π(s, a) = E_π[G_t | S_t = s, A_t = a]
           = E_π[Σ_{k=0}^{∞} γᵏ R_{t+k+1} | S_t = s, A_t = a]

Expected return starting from s, taking action a, then following π.

Relationship: V^π(s) = Σ_a π(a|s) · Q^π(s, a)


Bellman Equation for V^π — Full Derivation

Starting from the definition:

V^π(s) = E_π[G_t | S_t = s]
        = E_π[R_{t+1} + γG_{t+1} | S_t = s]         -- recursive form
        = Σ_a π(a|s) Σ_{s',r} p(s',r | s,a) [r + γE_π[G_{t+1}|S_{t+1}=s']]
        = Σ_a π(a|s) Σ_{s',r} p(s',r | s,a) [r + γV^π(s')]

The Bellman equation for V^π:

V^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r | s,a) [r + γV^π(s')]

This is a self-consistency condition: V^π satisfies this system of |S| equations in |S| unknowns. For finite MDPs, this linear system has a unique solution.

Matrix Form

V^π = R^π + γ P^π V^π

where:
  R^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) · r    -- expected immediate reward
  P^π(s,s') = Σ_a π(a|s) Σ_r p(s',r|s,a)            -- transition matrix under π

Solving: V^π = (I - γP^π)^{-1} R^π   (O(|S|³), direct but expensive)

Backup Diagram for V^π

    ●  ← state s
   /|\
  / | \            ← policy π selects actions (open circles)
 ○  ○  ○
 |  |  |           ← transition p(s',r|s,a)
 ●  ●  ●  ← next states s'

The Bellman equation is exactly the one-step backup: V^π(s) is computed from the values of successor states.


Bellman Equation for Q^π

Q^π(s,a) = Σ_{s',r} p(s',r|s,a) [r + γ Σ_{a'} π(a'|s') Q^π(s',a')]
           = Σ_{s',r} p(s',r|s,a) [r + γV^π(s')]

Optimal Value Functions

V*(s) — Optimal State Value

V*(s) = max_π V^π(s) = max_a Σ_{s',r} p(s',r|s,a) [r + γV*(s')]

Q*(s,a) — Optimal Action Value

Q*(s,a) = Σ_{s',r} p(s',r|s,a) [r + γ max_{a'} Q*(s',a')]
         = Σ_{s',r} p(s',r|s,a) [r + γV*(s')]

Relationship

V*(s) = max_a Q*(s,a)
Q*(s,a) = Σ_{s',r} p(s',r|s,a) [r + γV*(s')]

Optimal Policy from V* or Q*

π*(s) = argmax_a Q*(s,a)              -- from Q*: trivial (no model needed!)
π*(s) = argmax_a Σ_{s',r} p(s',r|s,a)[r+γV*(s')]  -- from V*: needs model

Key insight: Q* is strictly more useful than V* for model-free methods — we can read off the optimal policy directly without knowing the dynamics p.


Backup Diagram for V*

    ●  ← state s
    |
   max               ← agent picks best action (filled arc = max)
   /|\
  ○  ○  ○
  |  |  |
  ●  ●  ●  ← next states (environment random)

Compare to V^π: the max replaces the weighted sum over π.


Worked Example: 4×4 Gridworld

States: 1–14 (non-terminal), 0 and 15 (terminal). Policy π = uniform random (each direction with prob 0.25). All transitions to border stay in same cell. Reward = -1 per step except at terminal.

Bellman equation for state 1 (adjacent to terminal 0):

V^π(1) = 0.25·[(-1 + V^π(0))    ← go up → terminal
               +(-1 + V^π(5))    ← go down
               +(-1 + V^π(1))    ← go left → stay at 1
               +(-1 + V^π(2))]   ← go right
        = 0.25·[-1 + (-1 + V^π(5)) + (-1 + V^π(1)) + (-1 + V^π(2))]
        = -1 + 0.25·[V^π(0) + V^π(5) + V^π(1) + V^π(2)]

Solving the full 14-equation system gives V^π converging to values like V^π(1) ≈ -14 (average steps to reach a terminal square under random policy).


Properties of Bellman Equations

  1. Existence and uniqueness: for finite MDPs with γ < 1 (or episodic), both V^π and V* exist and are unique
  2. Contraction: the Bellman optimality operator T is a γ-contraction in the sup norm
  3. Fixed point: V* = T(V); iterating T from any initial V converges to V
  4. Linear system for V^π: (I - γPπ)Vπ = R^π; invertible since eigenvalues of γP^π have magnitude ≤ γ < 1