Actor-Critic Methods and PPO

Chapter 13 — Policy Gradient Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 331–348

Actor-Critic Architecture

REINFORCE uses the full MC return G_t as the gradient signal — high variance. Actor-Critic: replace G_t with a learned critic (value function) as baseline, updated via TD.

Actor:  π(a|s; θ)     ← policy network (what to do)
Critic: V(s; w)        ← value network (how good is state s)

Update rule (one-step actor-critic):

δ_t = R_{t+1} + γV(S_{t+1}; w) - V(S_t; w)   ← TD error = advantage estimate

θ ← θ + α_θ δ_t ∇_θ log π(A_t|S_t; θ)        ← actor update
w ← w + α_w δ_t ∇_w V(S_t; w)                 ← critic update (semi-gradient TD)

Why TD error ≈ Advantage?

δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t)
    ≈ Q^π(S_t, A_t) - V^π(S_t)          ← if V = V^π (by Bellman eq for V^π)
    = A^π(S_t, A_t)

So TD error is an unbiased estimate of the advantage when V = V^π.

Algorithm: One-Step Actor-Critic

Initialize: θ (actor), w (critic), I = 1

Loop (episodes):
  S ← initial state

  Loop (steps):
    A ← sample from π(·|S; θ)
    Observe R, S'
    δ ← R + γV(S'; w) - V(S; w)         ← TD error
    w ← w + α_w δ ∇_w V(S; w)           ← critic update
    θ ← θ + α_θ I δ ∇_θ log π(A|S; θ)  ← actor update
    I ← γI                              ← discount for full return (γ^t factor)
    S ← S'

The factor I = γ^t accounts for the discounted policy gradient theorem formulation.

A2C / A3C Architecture

A3C (Asynchronous Advantage Actor-Critic, Mnih et al., 2016): - Multiple parallel workers, each with own environment copy - Each worker: run n steps, compute n-step returns, update global network - Asynchronous: no lock; workers occasionally conflict — but this acts as implicit randomization

A2C (Synchronous A3C): - All workers synchronized; update once per batch - More stable training; similar performance

A2C:
  For each worker i = 1..k (in parallel):
    Run n steps in environment; collect (s,a,r,s')
    Compute n-step returns or GAE advantages

  Aggregate all workers → one gradient update:
    θ ← θ + α_θ E[Â_t ∇_θ log π(A_t|S_t;θ)]
    w ← w + α_w E[(G_t - V(S_t;w))² / 2 gradient]

Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) — one of the most important modern RL algorithms.

Problem with vanilla policy gradients: taking large gradient steps can overshoot, causing catastrophic performance drops. Policy gradient doesn’t bound how much the policy changes.

PPO-Clip: clip the policy ratio to prevent large updates:

PPO Objective

Define the probability ratio:

r_t(θ) = π(A_t|S_t; θ) / π(A_t|S_t; θ_old)

PPO-Clip objective:

L^{CLIP}(θ) = E_t [min(r_t(θ) Â_t, clip(r_t(θ), 1-ε, 1+ε) Â_t)]

Intuition: - If Â_t > 0 (good action): increase π(A_t|S_t; θ) - But if r_t > 1+ε (policy already increased too much): clip → stop increasing - If Â_t < 0 (bad action): decrease π(A_t|S_t; θ) - But if r_t < 1-ε (policy already decreased too much): clip → stop decreasing

Conservative update: clipping ensures the policy doesn’t change too much from π_old in either direction.

Full PPO Loss

L^{PPO}(θ, w) = L^{CLIP}(θ) - c_1 L^{VF}(w) + c_2 S[π(·|S_t; θ)]

where:
  L^{VF}(w) = E_t [(V(S_t; w) - V_t^{target})²]    ← value loss
  S[π] = H[π(·|s; θ)] = -Σ_a π log π               ← entropy bonus
  c_1 = 0.5, c_2 = 0.01 (typical)

Entropy bonus: encourages exploration by penalizing overconfident policies.

PPO Algorithm

Input: initial θ, w; clip ε = 0.2; K epochs; M minibatches

Loop (iterations):
  1. COLLECT: run N steps with π(·; θ_old) across T parallel envs
     Store (s_t, a_t, r_t, s_{t+1}, V(s_t; w_old), log π(a_t|s_t; θ_old))

  2. COMPUTE ADVANTAGES: GAE(λ=0.95)
     Â_t = Σ_{l=0}^{T-1-t} (γλ)^l δ_{t+l}   ← backward pass

  3. UPDATE: for K epochs over all collected data (shuffled minibatches):
     a. Compute r_t(θ) = exp(log π(a_t|s_t;θ) - log π(a_t|s_t;θ_old))
     b. Compute L^{CLIP} using Â_t and r_t(θ)
     c. Compute L^{VF} = (V(s_t;w) - Â_t - V_old(s_t))²
     d. Compute L^{entropy} = H[π(·|s_t;θ)]
     e. Loss = -L^{CLIP} + c_1 L^{VF} - c_2 L^{entropy}
     f. Gradient step: θ, w ← θ, w - Adam(∇ Loss)

  4. θ_old ← θ; w_old ← w

Key hyperparameters: - ε = 0.2: clip ratio (0.1-0.3 typical) - γ = 0.99: discount - λ = 0.95: GAE - K = 4-10: epochs per batch - T = 128-2048: steps per rollout - lr = 3×10^{-4}: Adam learning rate

Why PPO Works

Conservative update principle: each update stays close to the old policy (ratio ∈ [1-ε, 1+ε]). This is an approximate version of TRPO’s KL constraint:

TRPO: max L(θ) s.t. E[KL(π_old || π)] ≤ δ
PPO:  max E[min(rÂ, clip(r,1-ε,1+ε)Â)]    ← first-order approx to TRPO

PPO approximates TRPO’s constraint without requiring expensive second-order KL computation.

Monotonic improvement guarantee (TRPO): if the constraint is satisfied exactly, the policy improves monotonically. PPO approximately satisfies this.

Multi-epoch reuse: after collecting T steps, PPO does K gradient epochs on the same data — unlike standard policy gradient (one update per episode). The clipping prevents the policy from going too far from π_old, making multi-epoch safe.

Trust Region Methods (TRPO)

TRPO (Schulman et al., 2015) — the theoretical precursor to PPO:

max_θ L(θ) = E_t [π(A_t|S_t;θ)/π(A_t|S_t;θ_old) Â_t]
subject to: E_t[KL(π_old(·|S_t) || π(·|S_t;θ))] ≤ δ

Monotonic improvement theorem (Kakade & Langford, 2002):

J(π_new) ≥ J(π_old) + 1/(1-γ) E_π_old[Σ_s d^π_old(s) Σ_a π_new(a|s) A^π_old(s,a)]
         - C · max_s KL(π_new(·|s) || π_old(·|s))

If the surrogate L(θ) ≥ J(π_old) under the constraint, then J(π_new) ≥ J(π_old).

TRPO is theoretically superior but computationally expensive (requires conjugate gradient for KL-constrained optimization). PPO is simpler and nearly as good.

Actor-Critic Comparison

Method	Gradient signal	Advantage estimate	Notes
REINFORCE	G_t (MC)	None	High variance, unbiased
REINFORCE + baseline	G_t - b(s)	Approximate	Lower variance
1-step AC	δ_t (1-step TD)	1-step TD error	Low variance, biased
A2C/A3C	n-step G_t - V	n-step advantage	Middle ground
PPO	GAE Â_t	GAE(λ=0.95)	Best practical
SAC	Soft Q-value	Entropy-regularized	Off-policy, continuous

Soft Actor-Critic (SAC)

SAC (Haarnoja et al., 2018): off-policy actor-critic with entropy regularization.

Objective: maximize reward + temperature × entropy:

J_SAC(π) = E_π[Σ_t γ^t (R_t + α H[π(·|S_t)])]

Soft Bellman equation:

V^soft(s) = Σ_a π(a|s) [Q^soft(s,a) - α log π(a|s)]
Q^soft(s,a) = R + γ V^soft(S')

SAC alternates: update soft Q-network (off-policy with experience replay), update policy (maximize soft Q + entropy).

Benefits of SAC: - Off-policy: reuses experience from replay buffer (sample efficient) - Automatic entropy tuning: α adjusted to maintain target entropy - State-of-the-art on continuous control benchmarks

Chapter 13 Summary

Policy gradient theorem: ∇J(θ) = E_π[Q^π(s,a) ∇_θ log π(a|s;θ)]
REINFORCE: Monte Carlo estimate of Q^π; unbiased, high variance
Baseline: subtracting V^π(s) from Q^π gives advantage A^π — reduces variance, no bias
Actor-critic: TD-based critic (V or Q) provides low-variance advantage estimates
A2C/A3C: parallel workers collect data; synchronous/asynchronous updates
PPO: clipped ratio r_t ∈ [1-ε, 1+ε] → conservative, multi-epoch reuse; practical SOTA
SAC: off-policy, entropy-regularized; best for continuous control

Connection to DynamICCL

DynamICCL uses PPO — this entire chapter is directly applicable:

PPO for NCCL Tuning (DynamICCL):
  Actor: π(NCCL_config | network_state; θ_π) — policy network
  Critic: V(network_state; θ_V) — value network
  GAE: Â_t = GAE(γ=0.99, λ=0.95) on NCCL rollouts
  Clip: ε = 0.2
  Loss: L = -L^CLIP + 0.5 L^VF - 0.01 L^entropy

Training loop:
  1. Run DL training job, collecting (state, NCCL_config, throughput) at each step
  2. Compute GAE advantages
  3. K=4 PPO epochs over collected data
  4. Update policy → next iteration uses new NCCL selection strategy

SAC is an alternative if DynamICCL needs off-policy learning (reusing experience from many past training runs).