Policy Gradient Methods and the Policy Gradient Theorem
Chapter 13 — Policy Gradient Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 321–346
Why Policy Gradients?
Value-based methods (Q-learning, SARSA): learn Q(s,a;θ), then derive policy as argmax_a Q(s,a;θ). - Problem 1: deterministic policy — no natural way to express stochastic policies - Problem 2: discontinuous improvement — small change in Q can flip action entirely - Problem 3: large action spaces — argmax over continuous actions is hard
Policy gradient methods: directly parameterize the policy π(a|s; θ) and optimize θ via gradient ascent on J(θ) = E_π[G_t].
Benefits: 1. Stochastic policies: natural for exploration, partial observability, aliased states 2. Smooth updates: policy changes continuously with θ → stable learning 3. Continuous actions: softmax over Gaussian; can handle continuous action spaces 4. Convergence to local optima: gradient ascent has well-defined convergence conditions
Policy Parameterization
Softmax (discrete actions):
π(a|s; θ) = exp(h(s,a; θ)) / Σ_{a'} exp(h(s,a'; θ))
where h(s,a; θ) = θ^T φ(s,a) (linear) or NeuralNet(s,a;θ) (deep).
Gaussian (continuous actions):
π(a|s; θ) = N(a; μ(s; θ_μ), σ²(s; θ_σ))
μ(s; θ) = θ_μ^T φ(s) ← mean action
σ(s; θ) = exp(θ_σ^T φ(s)) ← std dev (must be positive)
Properties of good parameterizations: - Differentiable w.r.t. θ (needed for gradient) - Non-degenerate (policy never collapses to deterministic unless optimal) - Expressive enough to represent optimal policy
The Policy Gradient Theorem
Objective (episodic):
J(θ) = V^{π_θ}(s_0) = E_π[G_0]
The policy gradient:
∇_θ J(θ) = E_π[G_t · ∇_θ log π(A_t|S_t; θ)]
Proof (episodic case):
∇_θ J(θ) = ∇_θ V^π(s_0)
= ∇_θ Σ_a π(a|s_0; θ) Q^π(s_0, a)
= Σ_a [∇_θ π(a|s_0; θ) Q^π(s_0, a) + π(a|s_0; θ) ∇_θ Q^π(s_0, a)]
Applying this recursively to ∇_θ Q^π (using the Bellman equation for Q^π):
∇_θ Q^π(s,a) = Σ_{s'} p(s'|s,a) ∇_θ V^π(s')
After unrolling recursion across all states:
∇_θ J(θ) = Σ_s d^π(s) Σ_a Q^π(s,a) ∇_θ π(a|s; θ)
where d^π(s) = Σ_{t=0}^∞ γ^t P(S_t=s | π) (discounted state visitation frequency).
Key identity (score function trick):
∇_θ π(a|s; θ) = π(a|s; θ) · ∇_θ log π(a|s; θ)
Substituting:
∇_θ J(θ) = Σ_s d^π(s) Σ_a π(a|s; θ) Q^π(s,a) ∇_θ log π(a|s; θ)
= E_π[Q^π(S_t, A_t) ∇_θ log π(A_t|S_t; θ)]
∎
The REINFORCE Algorithm
Policy gradient update with MC estimate of Q^π:
Replace Q^π(S_t, A_t) with the sampled return G_t:
θ_{t+1} = θ_t + α G_t ∇_θ log π(A_t|S_t; θ_t)
Algorithm: REINFORCE (Williams, 1992)
Initialize θ (policy parameters)
Loop (episodes):
Generate episode: S_0, A_0, R_1, ..., S_T following π(·|·; θ)
For each step t = 0, 1, ..., T-1:
G ← Σ_{k=t+1}^{T} γ^{k-t-1} R_k ← compute return from t onward
θ ← θ + α γ^t G ∇_θ log π(A_t|S_t; θ) ← policy gradient update
Properties: - Unbiased: G_t is an unbiased estimate of Q^π(S_t, A_t) - High variance: G_t depends on all future rewards (MC) - Requires complete episodes - Converges to a local optimum of J(θ)
**Why ∇_θ log π?** This is the score function or log-probability gradient:
∇_θ log π(a|s; θ) = ∇_θ π(a|s; θ) / π(a|s; θ)
Doesn’t require knowing Q^π — only the policy gradient through π.
REINFORCE with Baseline
Variance reduction: add a baseline b(S_t) that doesn’t depend on A_t:
θ ← θ + α (G_t - b(S_t)) ∇_θ log π(A_t|S_t; θ)
Baseline doesn’t introduce bias:
E_π[(G_t - b(S_t)) ∇_θ log π(A_t|S_t; θ)]
= E_π[G_t ∇_θ log π] - b(S_t) E_π[∇_θ log π]
= E_π[G_t ∇_θ log π] - 0
Since E_π[∇_θ log π(A_t|S_t; θ)] = Σ_a ∇_θ π(a|s; θ) = ∇_θ Σ_a π(a|s; θ) = ∇_θ 1 = 0.
Optimal baseline (minimizes variance):
b*(s) = E[G_t² (∇_θ log π)^T] / E[(∇_θ log π)^T] ← component-wise
Practical baseline: V^π(s) — the state value. Then:
G_t - V^π(S_t) = A^π(S_t, A_t) ← advantage function!
REINFORCE with V baseline:
θ ← θ + α A^π(S_t, A_t) ∇_θ log π(A_t|S_t; θ)
This requires learning V^π — motivates the actor-critic architecture.
Advantage Function
Advantage: how much better is action a than average in state s?
A^π(s, a) = Q^π(s, a) - V^π(s)
Properties: - E_π[A^π(s, a)] = 0 (mean zero by definition) - A^π(s, π(s)) = 0 for deterministic policy π (taking the policy action is “average”) - A^π(s, a) > 0 for the best action a
Why use advantage over Q? Lower variance baseline. Updating with A^π instead of Q^π: - Does NOT change the direction of the gradient (same expectation) - Significantly reduces variance (A^π is centered at 0; Q^π can be large and positive)
The Policy Gradient Theorem: Continuing Case
For continuing tasks with average reward r̄:
∇_θ J(θ) = Σ_s μ^π(s) Σ_a A^π(s,a) ∇_θ π(a|s; θ)
= E_π[A^π(S_t, A_t) ∇_θ log π(A_t|S_t; θ)]
Same form as episodic — just replace discounted visitation with stationary distribution μ^π.
Worked Example: Short Corridor Gridworld
Setup (S&B Example 13.1): - 4 states (0,1,2,3); 3 is terminal - In state 1: actions are reversed (right→left, left→right) — “aliased” state - Goal: reach state 3 with maximum reward
Problem for deterministic policies: state 1 looks like state 0 (both in corridor). If π(right|0) = 1 and π(right|1) = 1, agent gets stuck at 1 (goes right = actually left).
Stochastic policy wins: π(right|s; θ) = sigmoid(θ). With optimal θ: π(right) ≈ 0.59 — takes right 59% of time, discovering the reversed direction and reaching terminal.
No deterministic policy can solve this — only stochastic policy gradients find the optimal.
Connection to DynamICCL
The policy gradient theorem is the foundation of PPO used in DynamICCL:
∇_θ J(θ) = E_π[Â_t ∇_θ log π(A_t|S_t; θ)]
DynamICCL: Â_t = GAE advantage for NCCL step t
∇_θ log π = gradient of policy network w.r.t. NCCL configuration logits
θ ← θ + α E[Â_t ∇_θ log π] (REINFORCE)
→ PPO adds clipping and value network (next file)
The stochastic policy π(NCCL_config|network_state; θ) is crucial — DynamICCL needs exploration, and a deterministic policy would get stuck in local optima.