Monte Carlo Methods: Summary and Comparison

Chapter 5 — Monte Carlo Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 91–130


Summary of MC Methods

Method Description Key Property
First-visit MC prediction Average G_t for first visit to s per episode Unbiased, high variance
Every-visit MC prediction Average G_t for all visits to s per episode Biased, lower variance
MC ES (exploring starts) On-policy control with random starts Finds optimal policy; requires random starts
On-policy MC control ε-greedy behavior = target; gradual improvement Finds best ε-soft policy
Off-policy MC (OIS) Target π, behavior b, IS correction Unbiased but can have infinite variance
Off-policy MC (WIS) Weighted IS (consistent) Biased but finite variance; preferred

MC vs DP: Key Differences

Dimension Dynamic Programming Monte Carlo
Model required? Yes: full p(s’,r|s,a) No: only experience
State coverage All states (full sweeps) Only visited states
Bootstrapping Yes: V(s) depends on V(s’) No: uses actual return G_t
Bias None (exact Bellman) None (G_t is exact return)
Variance Low (exact expectations) High (single-episode samples)
Episode end required? No (step-by-step) Yes (need terminal reward)
Convergence rate Fast (exact model) Slow (high variance)

What MC Does NOT Do: Bootstrapping

Bootstrapping: using the current estimate V̂(S_{t+1}) as a proxy for the remaining return.

Benefits of not bootstrapping: 1. No approximation error from V estimates (unbiased) 2. No dependency on accurate V estimates at other states 3. More robust to errors in the value function

Cost of not bootstrapping: 1. High variance (G_t depends on many random steps) 2. Slow propagation of information (credit assignment across long episodes is noisy) 3. Must wait for episode end


Backup Depth Comparison

DP backup (1-step): uses model p, averages over all (s', r)
    ●
    ↓  (one step, all branches)
    ●  ●  ●  ← all possible next states

MC backup (full episode): uses actual trajectory
    ●
    ↓  (one trajectory, full episode)
    ↓
    ↓
    ↓
    ●  ← terminal state

MC does a “deep backup” — all the way to the end of the episode. TD (Ch.6) does a “shallow backup” like DP but with a sample (1 step), combining the advantages of both.


The n-Step Return Preview

The spectrum between MC (depth T) and TD (depth 1) is captured by n-step returns:

n=1  (TD):  G_t^{(1)} = R_{t+1} + γV(S_{t+1})
n=2:        G_t^{(2)} = R_{t+1} + γR_{t+2} + γ²V(S_{t+2})
...
n=T  (MC):  G_t^{(T)} = R_{t+1} + γR_{t+2} + ... + γ^{T-t-1}R_T

TD(λ) (Ch.12) further interpolates with a geometric mixture:

G_t^λ = (1-λ) Σ_{n=1}^{∞} λ^{n-1} G_t^{(n)}

λ=0: pure TD; λ=1: pure MC. This is the theoretical justification for GAE in PPO.


Convergence Conditions for MC

First-visit MC prediction: converges to V^π as episodes → ∞. - Proof: by LLN, sample mean → true mean for i.i.d. samples

MC control (both on and off-policy): converges to Q* and π* under conditions: 1. All (s,a) pairs visited infinitely often (exploration requirement) 2. π and Q updated infinitely often 3. Appropriate stepsize schedule (Robbins-Monro for online updates)

Practical note: MC control convergence is slow in practice — needs many complete episodes. Modern methods (TD, PPO) converge faster by bootstrapping.


Variance Reduction Techniques

1. Baseline Subtraction

Instead of G_t, use G_t - b(S_t) for some baseline b(S_t).

If b(S_t) = V^π(S_t): E[G_t - V^π(S_t)] = 0, so this is the advantage A^π(S_t, A_t).

Variance reduction without bias:

Var[G_t - V^π(S_t)] = Var[G_t] + Var[V^π(S_t)] - 2Cov[G_t, V^π(S_t)]
                    << Var[G_t]  (when V^π is a good approximation)

This is exactly the role of the critic in actor-critic (Ch.13).

2. Control Variates

Use additional correlated random variables to reduce variance. Doubly Robust estimators combine MC and model-based estimates:

V̂(s) = V_model(s) + IS-corrected [V_MC - V_model]

3. Stratified Sampling / RLOO

Leave-one-out MC baselines: for k parallel trajectories, use mean of others as baseline. Used in GRPO (Group Relative Policy Optimization — DeepSeek’s training method).


Monte Carlo Tree Search (MCTS)

MCTS applies MC to model-based planning:

Loop (simulations):
  1. SELECT: tree policy (UCB) to choose path
  2. EXPAND: add new node to tree
  3. SIMULATE: rollout from new node using rollout policy (fast, often random)
  4. BACKPROPAGATE: update Q(s,a) along path with MC return

Used in: AlphaGo/AlphaZero, chess engines. Combines: - Model (game rules) for tree search - MC simulation for leaf evaluation - UCB for tree exploration (bandit algorithm!) - Deep network for position evaluation (replacing random rollout)

MCTS is not in S&B Chapter 5 (covered in Ch.8), but conceptually belongs here.


Chapter 5 Key Takeaways

  1. MC is model-free: learns from experience (episodes), no dynamics model needed
  2. MC is unbiased: G_t is the exact return — no bootstrap approximation error
  3. MC has high variance: full episode returns depend on many random events
  4. On-policy MC control converges to optimal ε-soft policy
  5. Off-policy MC (with IS) enables learning about one policy while following another — but has exponential variance for long episodes
  6. WIS is better than OIS in practice (bounded variance at cost of consistency)
  7. MC ↔︎ TD tradeoff: no bootstrap = unbiased but high variance; bootstrap = biased but low variance (Ch.6 + Ch.7)

Connection to DynamICCL

MC Concept DynamICCL Analog
Episode = complete rollout Training job segment = NCCL rollout period
G_t = discounted return Cumulative throughput improvement
MC control → greedy improvement PPO gradient step toward higher-advantage actions
Off-policy IS PPO clipped ratio ρ ∈ [1-ε, 1+ε]
Variance reduction via baseline GAE advantage = G_t - V(s_t)
n-step returns GAE with λ ∈ (0,1) — interpolates TD and MC