Policy Gradient Methods and the Policy Gradient Theorem

Chapter 13 — Policy Gradient Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 321–346

Why Policy Gradients?

Value-based methods (Q-learning, SARSA): learn Q(s,a;θ), then derive policy as argmax_a Q(s,a;θ). - Problem 1: deterministic policy — no natural way to express stochastic policies - Problem 2: discontinuous improvement — small change in Q can flip action entirely - Problem 3: large action spaces — argmax over continuous actions is hard

Policy gradient methods: directly parameterize the policy π(a|s; θ) and optimize θ via gradient ascent on J(θ) = E_π[G_t].

Benefits: 1. Stochastic policies: natural for exploration, partial observability, aliased states 2. Smooth updates: policy changes continuously with θ → stable learning 3. Continuous actions: softmax over Gaussian; can handle continuous action spaces 4. Convergence to local optima: gradient ascent has well-defined convergence conditions

Policy Parameterization

Softmax (discrete actions):

π(a|s; θ) = exp(h(s,a; θ)) / Σ_{a'} exp(h(s,a'; θ))

where h(s,a; θ) = θ^T φ(s,a) (linear) or NeuralNet(s,a;θ) (deep).

Gaussian (continuous actions):

π(a|s; θ) = N(a; μ(s; θ_μ), σ²(s; θ_σ))
μ(s; θ) = θ_μ^T φ(s)    ← mean action
σ(s; θ) = exp(θ_σ^T φ(s))  ← std dev (must be positive)

Properties of good parameterizations: - Differentiable w.r.t. θ (needed for gradient) - Non-degenerate (policy never collapses to deterministic unless optimal) - Expressive enough to represent optimal policy

The Policy Gradient Theorem

Objective (episodic):

J(θ) = V^{π_θ}(s_0) = E_π[G_0]

The policy gradient:

∇_θ J(θ) = E_π[G_t · ∇_θ log π(A_t|S_t; θ)]

Proof (episodic case):

∇_θ J(θ) = ∇_θ V^π(s_0)
           = ∇_θ Σ_a π(a|s_0; θ) Q^π(s_0, a)
           = Σ_a [∇_θ π(a|s_0; θ) Q^π(s_0, a) + π(a|s_0; θ) ∇_θ Q^π(s_0, a)]

Applying this recursively to ∇_θ Q^π (using the Bellman equation for Q^π):

∇_θ Q^π(s,a) = Σ_{s'} p(s'|s,a) ∇_θ V^π(s')

After unrolling recursion across all states:

∇_θ J(θ) = Σ_s d^π(s) Σ_a Q^π(s,a) ∇_θ π(a|s; θ)

where d^π(s) = Σ_{t=0}^∞ γ^t P(S_t=s | π) (discounted state visitation frequency).

Key identity (score function trick):

∇_θ π(a|s; θ) = π(a|s; θ) · ∇_θ log π(a|s; θ)

Substituting:

∇_θ J(θ) = Σ_s d^π(s) Σ_a π(a|s; θ) Q^π(s,a) ∇_θ log π(a|s; θ)
          = E_π[Q^π(S_t, A_t) ∇_θ log π(A_t|S_t; θ)]

∎

The REINFORCE Algorithm

Policy gradient update with MC estimate of Q^π:

Replace Q^π(S_t, A_t) with the sampled return G_t:

θ_{t+1} = θ_t + α G_t ∇_θ log π(A_t|S_t; θ_t)

Algorithm: REINFORCE (Williams, 1992)

Initialize θ (policy parameters)

Loop (episodes):
  Generate episode: S_0, A_0, R_1, ..., S_T following π(·|·; θ)

  For each step t = 0, 1, ..., T-1:
    G ← Σ_{k=t+1}^{T} γ^{k-t-1} R_k    ← compute return from t onward
    θ ← θ + α γ^t G ∇_θ log π(A_t|S_t; θ)  ← policy gradient update

Properties: - Unbiased: G_t is an unbiased estimate of Q^π(S_t, A_t) - High variance: G_t depends on all future rewards (MC) - Requires complete episodes - Converges to a local optimum of J(θ)

**Why ∇_θ log π?** This is the score function or log-probability gradient:

∇_θ log π(a|s; θ) = ∇_θ π(a|s; θ) / π(a|s; θ)

Doesn’t require knowing Q^π — only the policy gradient through π.

REINFORCE with Baseline

Variance reduction: add a baseline b(S_t) that doesn’t depend on A_t:

θ ← θ + α (G_t - b(S_t)) ∇_θ log π(A_t|S_t; θ)

Baseline doesn’t introduce bias:

E_π[(G_t - b(S_t)) ∇_θ log π(A_t|S_t; θ)]
= E_π[G_t ∇_θ log π] - b(S_t) E_π[∇_θ log π]
= E_π[G_t ∇_θ log π] - 0

Since E_π[∇_θ log π(A_t|S_t; θ)] = Σ_a ∇_θ π(a|s; θ) = ∇_θ Σ_a π(a|s; θ) = ∇_θ 1 = 0.

Optimal baseline (minimizes variance):

b*(s) = E[G_t² (∇_θ log π)^T] / E[(∇_θ log π)^T]  ← component-wise

Practical baseline: V^π(s) — the state value. Then:

G_t - V^π(S_t) = A^π(S_t, A_t)   ← advantage function!

REINFORCE with V baseline:

θ ← θ + α A^π(S_t, A_t) ∇_θ log π(A_t|S_t; θ)

This requires learning V^π — motivates the actor-critic architecture.

Advantage Function

Advantage: how much better is action a than average in state s?

A^π(s, a) = Q^π(s, a) - V^π(s)

Properties: - E_π[A^π(s, a)] = 0 (mean zero by definition) - A^π(s, π(s)) = 0 for deterministic policy π (taking the policy action is “average”) - A^π(s, a) > 0 for the best action a

Why use advantage over Q? Lower variance baseline. Updating with A^π instead of Q^π: - Does NOT change the direction of the gradient (same expectation) - Significantly reduces variance (A^π is centered at 0; Q^π can be large and positive)

The Policy Gradient Theorem: Continuing Case

For continuing tasks with average reward r̄:

∇_θ J(θ) = Σ_s μ^π(s) Σ_a A^π(s,a) ∇_θ π(a|s; θ)
          = E_π[A^π(S_t, A_t) ∇_θ log π(A_t|S_t; θ)]

Same form as episodic — just replace discounted visitation with stationary distribution μ^π.

Worked Example: Short Corridor Gridworld

Setup (S&B Example 13.1): - 4 states (0,1,2,3); 3 is terminal - In state 1: actions are reversed (right→left, left→right) — “aliased” state - Goal: reach state 3 with maximum reward

Problem for deterministic policies: state 1 looks like state 0 (both in corridor). If π(right|0) = 1 and π(right|1) = 1, agent gets stuck at 1 (goes right = actually left).

Stochastic policy wins: π(right|s; θ) = sigmoid(θ). With optimal θ: π(right) ≈ 0.59 — takes right 59% of time, discovering the reversed direction and reaching terminal.

No deterministic policy can solve this — only stochastic policy gradients find the optimal.

Connection to DynamICCL

The policy gradient theorem is the foundation of PPO used in DynamICCL:

∇_θ J(θ) = E_π[Â_t ∇_θ log π(A_t|S_t; θ)]

DynamICCL: Â_t = GAE advantage for NCCL step t
           ∇_θ log π = gradient of policy network w.r.t. NCCL configuration logits
           θ ← θ + α E[Â_t ∇_θ log π]    (REINFORCE)
           → PPO adds clipping and value network (next file)

The stochastic policy π(NCCL_config|network_state; θ) is crucial — DynamICCL needs exploration, and a deterministic policy would get stuck in local optima.