Actor-Critic Methods and PPO
Chapter 13 — Policy Gradient Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 331–348
Actor-Critic Architecture
REINFORCE uses the full MC return G_t as the gradient signal — high variance. Actor-Critic: replace G_t with a learned critic (value function) as baseline, updated via TD.
Actor: π(a|s; θ) ← policy network (what to do)
Critic: V(s; w) ← value network (how good is state s)
Update rule (one-step actor-critic):
δ_t = R_{t+1} + γV(S_{t+1}; w) - V(S_t; w) ← TD error = advantage estimate
θ ← θ + α_θ δ_t ∇_θ log π(A_t|S_t; θ) ← actor update
w ← w + α_w δ_t ∇_w V(S_t; w) ← critic update (semi-gradient TD)
Why TD error ≈ Advantage?
δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t)
≈ Q^π(S_t, A_t) - V^π(S_t) ← if V = V^π (by Bellman eq for V^π)
= A^π(S_t, A_t)
So TD error is an unbiased estimate of the advantage when V = V^π.
Algorithm: One-Step Actor-Critic
Initialize: θ (actor), w (critic), I = 1
Loop (episodes):
S ← initial state
Loop (steps):
A ← sample from π(·|S; θ)
Observe R, S'
δ ← R + γV(S'; w) - V(S; w) ← TD error
w ← w + α_w δ ∇_w V(S; w) ← critic update
θ ← θ + α_θ I δ ∇_θ log π(A|S; θ) ← actor update
I ← γI ← discount for full return (γ^t factor)
S ← S'
The factor I = γ^t accounts for the discounted policy gradient theorem formulation.
A2C / A3C Architecture
A3C (Asynchronous Advantage Actor-Critic, Mnih et al., 2016): - Multiple parallel workers, each with own environment copy - Each worker: run n steps, compute n-step returns, update global network - Asynchronous: no lock; workers occasionally conflict — but this acts as implicit randomization
A2C (Synchronous A3C): - All workers synchronized; update once per batch - More stable training; similar performance
A2C:
For each worker i = 1..k (in parallel):
Run n steps in environment; collect (s,a,r,s')
Compute n-step returns or GAE advantages
Aggregate all workers → one gradient update:
θ ← θ + α_θ E[Â_t ∇_θ log π(A_t|S_t;θ)]
w ← w + α_w E[(G_t - V(S_t;w))² / 2 gradient]
Proximal Policy Optimization (PPO)
PPO (Schulman et al., 2017) — one of the most important modern RL algorithms.
Problem with vanilla policy gradients: taking large gradient steps can overshoot, causing catastrophic performance drops. Policy gradient doesn’t bound how much the policy changes.
PPO-Clip: clip the policy ratio to prevent large updates:
PPO Objective
Define the probability ratio:
r_t(θ) = π(A_t|S_t; θ) / π(A_t|S_t; θ_old)
PPO-Clip objective:
L^{CLIP}(θ) = E_t [min(r_t(θ) Â_t, clip(r_t(θ), 1-ε, 1+ε) Â_t)]
Intuition: - If Â_t > 0 (good action): increase π(A_t|S_t; θ) - But if r_t > 1+ε (policy already increased too much): clip → stop increasing - If Â_t < 0 (bad action): decrease π(A_t|S_t; θ) - But if r_t < 1-ε (policy already decreased too much): clip → stop decreasing
Conservative update: clipping ensures the policy doesn’t change too much from π_old in either direction.
Full PPO Loss
L^{PPO}(θ, w) = L^{CLIP}(θ) - c_1 L^{VF}(w) + c_2 S[π(·|S_t; θ)]
where:
L^{VF}(w) = E_t [(V(S_t; w) - V_t^{target})²] ← value loss
S[π] = H[π(·|s; θ)] = -Σ_a π log π ← entropy bonus
c_1 = 0.5, c_2 = 0.01 (typical)
Entropy bonus: encourages exploration by penalizing overconfident policies.
PPO Algorithm
Input: initial θ, w; clip ε = 0.2; K epochs; M minibatches
Loop (iterations):
1. COLLECT: run N steps with π(·; θ_old) across T parallel envs
Store (s_t, a_t, r_t, s_{t+1}, V(s_t; w_old), log π(a_t|s_t; θ_old))
2. COMPUTE ADVANTAGES: GAE(λ=0.95)
Â_t = Σ_{l=0}^{T-1-t} (γλ)^l δ_{t+l} ← backward pass
3. UPDATE: for K epochs over all collected data (shuffled minibatches):
a. Compute r_t(θ) = exp(log π(a_t|s_t;θ) - log π(a_t|s_t;θ_old))
b. Compute L^{CLIP} using Â_t and r_t(θ)
c. Compute L^{VF} = (V(s_t;w) - Â_t - V_old(s_t))²
d. Compute L^{entropy} = H[π(·|s_t;θ)]
e. Loss = -L^{CLIP} + c_1 L^{VF} - c_2 L^{entropy}
f. Gradient step: θ, w ← θ, w - Adam(∇ Loss)
4. θ_old ← θ; w_old ← w
Key hyperparameters: - ε = 0.2: clip ratio (0.1-0.3 typical) - γ = 0.99: discount - λ = 0.95: GAE - K = 4-10: epochs per batch - T = 128-2048: steps per rollout - lr = 3×10^{-4}: Adam learning rate
Why PPO Works
Conservative update principle: each update stays close to the old policy (ratio ∈ [1-ε, 1+ε]). This is an approximate version of TRPO’s KL constraint:
TRPO: max L(θ) s.t. E[KL(π_old || π)] ≤ δ
PPO: max E[min(rÂ, clip(r,1-ε,1+ε)Â)] ← first-order approx to TRPO
PPO approximates TRPO’s constraint without requiring expensive second-order KL computation.
Monotonic improvement guarantee (TRPO): if the constraint is satisfied exactly, the policy improves monotonically. PPO approximately satisfies this.
Multi-epoch reuse: after collecting T steps, PPO does K gradient epochs on the same data — unlike standard policy gradient (one update per episode). The clipping prevents the policy from going too far from π_old, making multi-epoch safe.
Trust Region Methods (TRPO)
TRPO (Schulman et al., 2015) — the theoretical precursor to PPO:
max_θ L(θ) = E_t [π(A_t|S_t;θ)/π(A_t|S_t;θ_old) Â_t]
subject to: E_t[KL(π_old(·|S_t) || π(·|S_t;θ))] ≤ δ
Monotonic improvement theorem (Kakade & Langford, 2002):
J(π_new) ≥ J(π_old) + 1/(1-γ) E_π_old[Σ_s d^π_old(s) Σ_a π_new(a|s) A^π_old(s,a)]
- C · max_s KL(π_new(·|s) || π_old(·|s))
If the surrogate L(θ) ≥ J(π_old) under the constraint, then J(π_new) ≥ J(π_old).
TRPO is theoretically superior but computationally expensive (requires conjugate gradient for KL-constrained optimization). PPO is simpler and nearly as good.
Actor-Critic Comparison
| Method | Gradient signal | Advantage estimate | Notes |
|---|---|---|---|
| REINFORCE | G_t (MC) | None | High variance, unbiased |
| REINFORCE + baseline | G_t - b(s) | Approximate | Lower variance |
| 1-step AC | δ_t (1-step TD) | 1-step TD error | Low variance, biased |
| A2C/A3C | n-step G_t - V | n-step advantage | Middle ground |
| PPO | GAE Â_t | GAE(λ=0.95) | Best practical |
| SAC | Soft Q-value | Entropy-regularized | Off-policy, continuous |
Soft Actor-Critic (SAC)
SAC (Haarnoja et al., 2018): off-policy actor-critic with entropy regularization.
Objective: maximize reward + temperature × entropy:
J_SAC(π) = E_π[Σ_t γ^t (R_t + α H[π(·|S_t)])]
Soft Bellman equation:
V^soft(s) = Σ_a π(a|s) [Q^soft(s,a) - α log π(a|s)]
Q^soft(s,a) = R + γ V^soft(S')
SAC alternates: update soft Q-network (off-policy with experience replay), update policy (maximize soft Q + entropy).
Benefits of SAC: - Off-policy: reuses experience from replay buffer (sample efficient) - Automatic entropy tuning: α adjusted to maintain target entropy - State-of-the-art on continuous control benchmarks
Chapter 13 Summary
- Policy gradient theorem: ∇J(θ) = E_π[Q^π(s,a) ∇_θ log π(a|s;θ)]
- REINFORCE: Monte Carlo estimate of Q^π; unbiased, high variance
- Baseline: subtracting V^π(s) from Q^π gives advantage A^π — reduces variance, no bias
- Actor-critic: TD-based critic (V or Q) provides low-variance advantage estimates
- A2C/A3C: parallel workers collect data; synchronous/asynchronous updates
- PPO: clipped ratio r_t ∈ [1-ε, 1+ε] → conservative, multi-epoch reuse; practical SOTA
- SAC: off-policy, entropy-regularized; best for continuous control
Connection to DynamICCL
DynamICCL uses PPO — this entire chapter is directly applicable:
PPO for NCCL Tuning (DynamICCL):
Actor: π(NCCL_config | network_state; θ_π) — policy network
Critic: V(network_state; θ_V) — value network
GAE: Â_t = GAE(γ=0.99, λ=0.95) on NCCL rollouts
Clip: ε = 0.2
Loss: L = -L^CLIP + 0.5 L^VF - 0.01 L^entropy
Training loop:
1. Run DL training job, collecting (state, NCCL_config, throughput) at each step
2. Compute GAE advantages
3. K=4 PPO epochs over collected data
4. Update policy → next iteration uses new NCCL selection strategy
SAC is an alternative if DynamICCL needs off-policy learning (reusing experience from many past training runs).