Monte Carlo Methods: Summary and Comparison

Chapter 5 — Monte Carlo Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 91–130

Summary of MC Methods

Method	Description	Key Property
First-visit MC prediction	Average G_t for first visit to s per episode	Unbiased, high variance
Every-visit MC prediction	Average G_t for all visits to s per episode	Biased, lower variance
MC ES (exploring starts)	On-policy control with random starts	Finds optimal policy; requires random starts
On-policy MC control	ε-greedy behavior = target; gradual improvement	Finds best ε-soft policy
Off-policy MC (OIS)	Target π, behavior b, IS correction	Unbiased but can have infinite variance
Off-policy MC (WIS)	Weighted IS (consistent)	Biased but finite variance; preferred

MC vs DP: Key Differences

Dimension	Dynamic Programming	Monte Carlo
Model required?	Yes: full p(s’,r\|s,a)	No: only experience
State coverage	All states (full sweeps)	Only visited states
Bootstrapping	Yes: V(s) depends on V(s’)	No: uses actual return G_t
Bias	None (exact Bellman)	None (G_t is exact return)
Variance	Low (exact expectations)	High (single-episode samples)
Episode end required?	No (step-by-step)	Yes (need terminal reward)
Convergence rate	Fast (exact model)	Slow (high variance)

What MC Does NOT Do: Bootstrapping

Bootstrapping: using the current estimate V̂(S_{t+1}) as a proxy for the remaining return.

DP: V(s) = Σ p(s’,r|s,a)[r + γV(s’)] — uses current V(s’) estimate
MC: V(s) estimates from G_t = R_{t+1} + γR_{t+2} + … + γ^{T-t-1}R_T — no V(s’) needed

Benefits of not bootstrapping: 1. No approximation error from V estimates (unbiased) 2. No dependency on accurate V estimates at other states 3. More robust to errors in the value function

Cost of not bootstrapping: 1. High variance (G_t depends on many random steps) 2. Slow propagation of information (credit assignment across long episodes is noisy) 3. Must wait for episode end

Backup Depth Comparison

DP backup (1-step): uses model p, averages over all (s', r)
    ●
    ↓  (one step, all branches)
    ●  ●  ●  ← all possible next states

MC backup (full episode): uses actual trajectory
    ●
    ↓  (one trajectory, full episode)
    ↓
    ↓
    ↓
    ●  ← terminal state

MC does a “deep backup” — all the way to the end of the episode. TD (Ch.6) does a “shallow backup” like DP but with a sample (1 step), combining the advantages of both.

The n-Step Return Preview

The spectrum between MC (depth T) and TD (depth 1) is captured by n-step returns:

n=1  (TD):  G_t^{(1)} = R_{t+1} + γV(S_{t+1})
n=2:        G_t^{(2)} = R_{t+1} + γR_{t+2} + γ²V(S_{t+2})
...
n=T  (MC):  G_t^{(T)} = R_{t+1} + γR_{t+2} + ... + γ^{T-t-1}R_T

TD(λ) (Ch.12) further interpolates with a geometric mixture:

G_t^λ = (1-λ) Σ_{n=1}^{∞} λ^{n-1} G_t^{(n)}

λ=0: pure TD; λ=1: pure MC. This is the theoretical justification for GAE in PPO.

Convergence Conditions for MC

First-visit MC prediction: converges to V^π as episodes → ∞. - Proof: by LLN, sample mean → true mean for i.i.d. samples

MC control (both on and off-policy): converges to Q* and π* under conditions: 1. All (s,a) pairs visited infinitely often (exploration requirement) 2. π and Q updated infinitely often 3. Appropriate stepsize schedule (Robbins-Monro for online updates)

Practical note: MC control convergence is slow in practice — needs many complete episodes. Modern methods (TD, PPO) converge faster by bootstrapping.

Variance Reduction Techniques

1. Baseline Subtraction

Instead of G_t, use G_t - b(S_t) for some baseline b(S_t).

If b(S_t) = V^π(S_t): E[G_t - V^π(S_t)] = 0, so this is the advantage A^π(S_t, A_t).

Variance reduction without bias:

Var[G_t - V^π(S_t)] = Var[G_t] + Var[V^π(S_t)] - 2Cov[G_t, V^π(S_t)]
                    << Var[G_t]  (when V^π is a good approximation)

This is exactly the role of the critic in actor-critic (Ch.13).

2. Control Variates

Use additional correlated random variables to reduce variance. Doubly Robust estimators combine MC and model-based estimates:

V̂(s) = V_model(s) + IS-corrected [V_MC - V_model]

3. Stratified Sampling / RLOO

Leave-one-out MC baselines: for k parallel trajectories, use mean of others as baseline. Used in GRPO (Group Relative Policy Optimization — DeepSeek’s training method).

Monte Carlo Tree Search (MCTS)

MCTS applies MC to model-based planning:

Loop (simulations):
  1. SELECT: tree policy (UCB) to choose path
  2. EXPAND: add new node to tree
  3. SIMULATE: rollout from new node using rollout policy (fast, often random)
  4. BACKPROPAGATE: update Q(s,a) along path with MC return

Used in: AlphaGo/AlphaZero, chess engines. Combines: - Model (game rules) for tree search - MC simulation for leaf evaluation - UCB for tree exploration (bandit algorithm!) - Deep network for position evaluation (replacing random rollout)

MCTS is not in S&B Chapter 5 (covered in Ch.8), but conceptually belongs here.

Chapter 5 Key Takeaways

MC is model-free: learns from experience (episodes), no dynamics model needed
MC is unbiased: G_t is the exact return — no bootstrap approximation error
MC has high variance: full episode returns depend on many random events
On-policy MC control converges to optimal ε-soft policy
Off-policy MC (with IS) enables learning about one policy while following another — but has exponential variance for long episodes
WIS is better than OIS in practice (bounded variance at cost of consistency)
MC ↔︎ TD tradeoff: no bootstrap = unbiased but high variance; bootstrap = biased but low variance (Ch.6 + Ch.7)

Connection to DynamICCL

MC Concept	DynamICCL Analog
Episode = complete rollout	Training job segment = NCCL rollout period
G_t = discounted return	Cumulative throughput improvement
MC control → greedy improvement	PPO gradient step toward higher-advantage actions
Off-policy IS	PPO clipped ratio ρ ∈ [1-ε, 1+ε]
Variance reduction via baseline	GAE advantage = G_t - V(s_t)
n-step returns	GAE with λ ∈ (0,1) — interpolates TD and MC