Monte Carlo Methods: Summary and Comparison
Chapter 5 — Monte Carlo Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 91–130
Summary of MC Methods
| Method | Description | Key Property |
|---|---|---|
| First-visit MC prediction | Average G_t for first visit to s per episode | Unbiased, high variance |
| Every-visit MC prediction | Average G_t for all visits to s per episode | Biased, lower variance |
| MC ES (exploring starts) | On-policy control with random starts | Finds optimal policy; requires random starts |
| On-policy MC control | ε-greedy behavior = target; gradual improvement | Finds best ε-soft policy |
| Off-policy MC (OIS) | Target π, behavior b, IS correction | Unbiased but can have infinite variance |
| Off-policy MC (WIS) | Weighted IS (consistent) | Biased but finite variance; preferred |
MC vs DP: Key Differences
| Dimension | Dynamic Programming | Monte Carlo |
|---|---|---|
| Model required? | Yes: full p(s’,r|s,a) | No: only experience |
| State coverage | All states (full sweeps) | Only visited states |
| Bootstrapping | Yes: V(s) depends on V(s’) | No: uses actual return G_t |
| Bias | None (exact Bellman) | None (G_t is exact return) |
| Variance | Low (exact expectations) | High (single-episode samples) |
| Episode end required? | No (step-by-step) | Yes (need terminal reward) |
| Convergence rate | Fast (exact model) | Slow (high variance) |
What MC Does NOT Do: Bootstrapping
Bootstrapping: using the current estimate V̂(S_{t+1}) as a proxy for the remaining return.
- DP: V(s) = Σ p(s’,r|s,a)[r + γV(s’)] — uses current V(s’) estimate
- MC: V(s) estimates from G_t = R_{t+1} + γR_{t+2} + … + γ^{T-t-1}R_T — no V(s’) needed
Benefits of not bootstrapping: 1. No approximation error from V estimates (unbiased) 2. No dependency on accurate V estimates at other states 3. More robust to errors in the value function
Cost of not bootstrapping: 1. High variance (G_t depends on many random steps) 2. Slow propagation of information (credit assignment across long episodes is noisy) 3. Must wait for episode end
Backup Depth Comparison
DP backup (1-step): uses model p, averages over all (s', r)
●
↓ (one step, all branches)
● ● ● ← all possible next states
MC backup (full episode): uses actual trajectory
●
↓ (one trajectory, full episode)
↓
↓
↓
● ← terminal state
MC does a “deep backup” — all the way to the end of the episode. TD (Ch.6) does a “shallow backup” like DP but with a sample (1 step), combining the advantages of both.
The n-Step Return Preview
The spectrum between MC (depth T) and TD (depth 1) is captured by n-step returns:
n=1 (TD): G_t^{(1)} = R_{t+1} + γV(S_{t+1})
n=2: G_t^{(2)} = R_{t+1} + γR_{t+2} + γ²V(S_{t+2})
...
n=T (MC): G_t^{(T)} = R_{t+1} + γR_{t+2} + ... + γ^{T-t-1}R_T
TD(λ) (Ch.12) further interpolates with a geometric mixture:
G_t^λ = (1-λ) Σ_{n=1}^{∞} λ^{n-1} G_t^{(n)}
λ=0: pure TD; λ=1: pure MC. This is the theoretical justification for GAE in PPO.
Convergence Conditions for MC
First-visit MC prediction: converges to V^π as episodes → ∞. - Proof: by LLN, sample mean → true mean for i.i.d. samples
MC control (both on and off-policy): converges to Q* and π* under conditions: 1. All (s,a) pairs visited infinitely often (exploration requirement) 2. π and Q updated infinitely often 3. Appropriate stepsize schedule (Robbins-Monro for online updates)
Practical note: MC control convergence is slow in practice — needs many complete episodes. Modern methods (TD, PPO) converge faster by bootstrapping.
Variance Reduction Techniques
1. Baseline Subtraction
Instead of G_t, use G_t - b(S_t) for some baseline b(S_t).
If b(S_t) = V^π(S_t): E[G_t - V^π(S_t)] = 0, so this is the advantage A^π(S_t, A_t).
Variance reduction without bias:
Var[G_t - V^π(S_t)] = Var[G_t] + Var[V^π(S_t)] - 2Cov[G_t, V^π(S_t)]
<< Var[G_t] (when V^π is a good approximation)
This is exactly the role of the critic in actor-critic (Ch.13).
2. Control Variates
Use additional correlated random variables to reduce variance. Doubly Robust estimators combine MC and model-based estimates:
V̂(s) = V_model(s) + IS-corrected [V_MC - V_model]
3. Stratified Sampling / RLOO
Leave-one-out MC baselines: for k parallel trajectories, use mean of others as baseline. Used in GRPO (Group Relative Policy Optimization — DeepSeek’s training method).
Monte Carlo Tree Search (MCTS)
MCTS applies MC to model-based planning:
Loop (simulations):
1. SELECT: tree policy (UCB) to choose path
2. EXPAND: add new node to tree
3. SIMULATE: rollout from new node using rollout policy (fast, often random)
4. BACKPROPAGATE: update Q(s,a) along path with MC return
Used in: AlphaGo/AlphaZero, chess engines. Combines: - Model (game rules) for tree search - MC simulation for leaf evaluation - UCB for tree exploration (bandit algorithm!) - Deep network for position evaluation (replacing random rollout)
MCTS is not in S&B Chapter 5 (covered in Ch.8), but conceptually belongs here.
Chapter 5 Key Takeaways
- MC is model-free: learns from experience (episodes), no dynamics model needed
- MC is unbiased: G_t is the exact return — no bootstrap approximation error
- MC has high variance: full episode returns depend on many random events
- On-policy MC control converges to optimal ε-soft policy
- Off-policy MC (with IS) enables learning about one policy while following another — but has exponential variance for long episodes
- WIS is better than OIS in practice (bounded variance at cost of consistency)
- MC ↔︎ TD tradeoff: no bootstrap = unbiased but high variance; bootstrap = biased but low variance (Ch.6 + Ch.7)
Connection to DynamICCL
| MC Concept | DynamICCL Analog |
|---|---|
| Episode = complete rollout | Training job segment = NCCL rollout period |
| G_t = discounted return | Cumulative throughput improvement |
| MC control → greedy improvement | PPO gradient step toward higher-advantage actions |
| Off-policy IS | PPO clipped ratio ρ ∈ [1-ε, 1+ε] |
| Variance reduction via baseline | GAE advantage = G_t - V(s_t) |
| n-step returns | GAE with λ ∈ (0,1) — interpolates TD and MC |