Monte Carlo Prediction
Chapter 5 — Monte Carlo Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 91–103
Why Monte Carlo?
Dynamic Programming (Ch.4) requires a complete model of the environment p(s’,r|s,a). In most real problems: - p is unknown (NCCL throughput dynamics, game opponent strategy, stock prices) - Even if known, |S| is too large for full sweeps
Monte Carlo (MC) methods learn directly from experience — actual or simulated sequences of states, actions, and rewards.
Key distinction from DP: - DP: backup V(s) from all possible successors (requires model) - MC: backup V(s) from the actual return observed in one episode (no model needed)
Monte Carlo Prediction: Learning V^π
Problem: given policy π, estimate V^π(s) for all s, using experience only.
Core idea: V^π(s) = E[G_t | S_t = s]. Estimate this expectation by averaging observed returns from visits to s.
First-Visit vs Every-Visit MC
First-Visit MC Prediction:
Algorithm: First-Visit MC Prediction for V^π
Initialize:
V(s) = 0 for all s ∈ S
Returns(s) = empty list for all s ∈ S
Loop (episodes):
Generate episode: S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T using π
G ← 0
For t = T-1, T-2, ..., 0:
G ← γG + R_{t+1} ← compute return backward
If S_t not in {S_0, S_1, ..., S_{t-1}}: ← first visit to S_t this episode
Returns(S_t).append(G)
V(S_t) ← average(Returns(S_t))
Every-Visit MC: same but remove the “first visit” check — update V(S_t) for every visit.
First-Visit vs Every-Visit Comparison
| First-Visit | Every-Visit | |
|---|---|---|
| Unbiased? | Yes (i.i.d. estimates per episode) | No (correlated estimates from same episode) |
| Converges to V^π? | Yes (by LLN) | Yes (but slower for correlated updates) |
| Standard in S&B | Yes | Less common |
Both converge to V^π as episodes → ∞.
Why Backward Computation of G?
Instead of computing G_t = R_{t+1} + γR_{t+2} + … directly (O(T²) total), use the recurrence:
G_{T-1} = R_T
G_{T-2} = R_{T-1} + γG_{T-1}
G_{T-3} = R_{T-2} + γG_{T-2}
...
G_0 = R_1 + γG_1
Total computation: O(T) per episode. This is the backward pass through the episode.
Worked Example: Blackjack
Setup (S&B Example 5.1): - State: (player sum 12–21, dealer’s showing card 1–10, player has usable ace: yes/no) - Actions: hit (take another card) or stick (stop) - Reward: +1 win, -1 loss, 0 draw; no intermediate rewards - Episode terminates when player sticks or busts (sum > 21)
Policy: stick if sum ≥ 20, hit otherwise.
State space: 200 states (10 × 10 × 2 = 200 combinations)
MC evaluation: - Run many episodes (10,000+) under this policy - For each episode, compute G_t (= final win/loss, since γ=1 and no intermediate rewards) - Average returns by state → V^π
Convergence: after ~500,000 episodes, V^π is well-estimated. Variance is high because rewards (+1/-1) have high variance and episodes are long.
Key observation: states near 21 have high values (likely to win); states near 12 have lower values (likely to keep hitting, risk busting). Usable ace increases value (can reinterpret ace as 1 if bust).
Advantages of MC over DP
- No model required: MC uses actual experience — just run episodes and observe returns
- Focuses on relevant states: only updates visited states — efficient for problems with large but rarely-visited state spaces
- Independence across episodes: estimating V(s) from one episode is independent of errors in V(other states) — no bootstrapping
- Unbiasedness: G_t is an unbiased estimate of V^π(s) (no approximation error, just variance)
Disadvantages and Limitations
Only for episodic tasks: MC requires complete episodes (need terminal state to compute G_t). Doesn’t work for continuing tasks.
High variance: G_t = R_{t+1} + γR_{t+2} + … depends on many random variables. Variance grows with episode length T:
Var[G_t] ≈ T · Var[R] (roughly, for uncorrelated rewards)Slow learning: must wait until end of episode to update. For long episodes (T = 1000 steps), each update is delayed by 1000 steps.
Non-incremental learning: naive implementation stores all returns then averages. Online/incremental version uses:
V(S_t) ← V(S_t) + α[G_t - V(S_t)]But G_t requires waiting until episode end.
Bias-Variance Tradeoff: MC vs TD
MC (n = ∞ steps): - Bias = 0: G_t is the true return (no bootstrap approximation) - Variance = high: G_t depends on all T future rewards
TD(0) (n = 1 step, see Ch.6): - Bias > 0: uses V(S_{t+1}) as proxy for remaining return (bootstrap approximation error) - Variance = low: depends only on R_{t+1} and V(S_{t+1})
n-step returns (Ch.7) interpolate:
G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n V(S_{t+n})
This is the bias-variance tradeoff in action: - n=1 (TD): low variance, some bias - n=T (MC): zero bias, high variance - Optimal n: depends on problem (often n=10-100)
MC for V^π vs Q^π
Problem: to improve a policy from V^π, we need a model (to compute Q^π(s,a) = r(s,a) + γ Σ p(s’|s,a) V^π(s’)). But MC is model-free — no model!
Solution: estimate Q^π(s,a) directly from experience:
Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]
Visit every (s,a) pair in episodes, average the return following each visit.
Problem: if π is deterministic, some (s,a) pairs may never be visited (π(s) = a’ ≠ a for most states). Need exploration to ensure all (s,a) pairs are visited (see file 2).
Connection to DynamICCL
MC prediction is not directly used in DynamICCL (which uses TD/PPO), but the principle is foundational: - GAE (Generalized Advantage Estimation) in PPO uses multi-step returns — a mix of MC (long horizon) and TD (bootstrap); the λ parameter interpolates between the two - Monte Carlo tree search (MCTS) — a model-based form of MC — is used in AlphaZero for game playing, a cousin of DynamICCL’s planning approach if a network simulation model were available