Monte Carlo Prediction

Chapter 5 — Monte Carlo Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 91–103


Why Monte Carlo?

Dynamic Programming (Ch.4) requires a complete model of the environment p(s’,r|s,a). In most real problems: - p is unknown (NCCL throughput dynamics, game opponent strategy, stock prices) - Even if known, |S| is too large for full sweeps

Monte Carlo (MC) methods learn directly from experience — actual or simulated sequences of states, actions, and rewards.

Key distinction from DP: - DP: backup V(s) from all possible successors (requires model) - MC: backup V(s) from the actual return observed in one episode (no model needed)


Monte Carlo Prediction: Learning V^π

Problem: given policy π, estimate V^π(s) for all s, using experience only.

Core idea: V^π(s) = E[G_t | S_t = s]. Estimate this expectation by averaging observed returns from visits to s.

First-Visit vs Every-Visit MC

First-Visit MC Prediction:

Algorithm: First-Visit MC Prediction for V^π

Initialize:
  V(s) = 0 for all s ∈ S
  Returns(s) = empty list for all s ∈ S

Loop (episodes):
  Generate episode: S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T using π

  G ← 0
  For t = T-1, T-2, ..., 0:
    G ← γG + R_{t+1}                              ← compute return backward
    If S_t not in {S_0, S_1, ..., S_{t-1}}:       ← first visit to S_t this episode
      Returns(S_t).append(G)
      V(S_t) ← average(Returns(S_t))

Every-Visit MC: same but remove the “first visit” check — update V(S_t) for every visit.

First-Visit vs Every-Visit Comparison

First-Visit Every-Visit
Unbiased? Yes (i.i.d. estimates per episode) No (correlated estimates from same episode)
Converges to V^π? Yes (by LLN) Yes (but slower for correlated updates)
Standard in S&B Yes Less common

Both converge to V^π as episodes → ∞.


Why Backward Computation of G?

Instead of computing G_t = R_{t+1} + γR_{t+2} + … directly (O(T²) total), use the recurrence:

G_{T-1} = R_T
G_{T-2} = R_{T-1} + γG_{T-1}
G_{T-3} = R_{T-2} + γG_{T-2}
...
G_0     = R_1 + γG_1

Total computation: O(T) per episode. This is the backward pass through the episode.


Worked Example: Blackjack

Setup (S&B Example 5.1): - State: (player sum 12–21, dealer’s showing card 1–10, player has usable ace: yes/no) - Actions: hit (take another card) or stick (stop) - Reward: +1 win, -1 loss, 0 draw; no intermediate rewards - Episode terminates when player sticks or busts (sum > 21)

Policy: stick if sum ≥ 20, hit otherwise.

State space: 200 states (10 × 10 × 2 = 200 combinations)

MC evaluation: - Run many episodes (10,000+) under this policy - For each episode, compute G_t (= final win/loss, since γ=1 and no intermediate rewards) - Average returns by state → V^π

Convergence: after ~500,000 episodes, V^π is well-estimated. Variance is high because rewards (+1/-1) have high variance and episodes are long.

Key observation: states near 21 have high values (likely to win); states near 12 have lower values (likely to keep hitting, risk busting). Usable ace increases value (can reinterpret ace as 1 if bust).


Advantages of MC over DP

  1. No model required: MC uses actual experience — just run episodes and observe returns
  2. Focuses on relevant states: only updates visited states — efficient for problems with large but rarely-visited state spaces
  3. Independence across episodes: estimating V(s) from one episode is independent of errors in V(other states) — no bootstrapping
  4. Unbiasedness: G_t is an unbiased estimate of V^π(s) (no approximation error, just variance)

Disadvantages and Limitations

  1. Only for episodic tasks: MC requires complete episodes (need terminal state to compute G_t). Doesn’t work for continuing tasks.

  2. High variance: G_t = R_{t+1} + γR_{t+2} + … depends on many random variables. Variance grows with episode length T:

    Var[G_t] ≈ T · Var[R]  (roughly, for uncorrelated rewards)
  3. Slow learning: must wait until end of episode to update. For long episodes (T = 1000 steps), each update is delayed by 1000 steps.

  4. Non-incremental learning: naive implementation stores all returns then averages. Online/incremental version uses:

    V(S_t) ← V(S_t) + α[G_t - V(S_t)]

    But G_t requires waiting until episode end.


Bias-Variance Tradeoff: MC vs TD

MC (n = ∞ steps): - Bias = 0: G_t is the true return (no bootstrap approximation) - Variance = high: G_t depends on all T future rewards

TD(0) (n = 1 step, see Ch.6): - Bias > 0: uses V(S_{t+1}) as proxy for remaining return (bootstrap approximation error) - Variance = low: depends only on R_{t+1} and V(S_{t+1})

n-step returns (Ch.7) interpolate:

G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n V(S_{t+n})

This is the bias-variance tradeoff in action: - n=1 (TD): low variance, some bias - n=T (MC): zero bias, high variance - Optimal n: depends on problem (often n=10-100)


MC for V^π vs Q^π

Problem: to improve a policy from V^π, we need a model (to compute Q^π(s,a) = r(s,a) + γ Σ p(s’|s,a) V^π(s’)). But MC is model-free — no model!

Solution: estimate Q^π(s,a) directly from experience:

Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]

Visit every (s,a) pair in episodes, average the return following each visit.

Problem: if π is deterministic, some (s,a) pairs may never be visited (π(s) = a’ ≠ a for most states). Need exploration to ensure all (s,a) pairs are visited (see file 2).


Connection to DynamICCL

MC prediction is not directly used in DynamICCL (which uses TD/PPO), but the principle is foundational: - GAE (Generalized Advantage Estimation) in PPO uses multi-step returns — a mix of MC (long horizon) and TD (bootstrap); the λ parameter interpolates between the two - Monte Carlo tree search (MCTS) — a model-based form of MC — is used in AlphaZero for game playing, a cousin of DynamICCL’s planning approach if a network simulation model were available