Monte Carlo Control
Chapter 5 — Monte Carlo Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 99–109
The Control Problem with MC
Goal: find the optimal policy π* using only experience (no model).
GPI with MC: 1. Evaluation: estimate Q^π(s,a) using MC (average returns from (s,a)) 2. Improvement: update π to be greedy w.r.t. Q^π
Challenge: if the policy is deterministic (π(s) = a for all s), many (s,a) pairs are never visited — Q^π(s, a’) for a’ ≠ π(s) is never updated. Policy can’t improve past what it’s already trying.
Solution: ensure all (s,a) pairs are tried → exploration.
Exploring Starts (Monte Carlo ES)
Assumption: every episode can start with any (s, a) pair with nonzero probability (“exploring starts”).
With this assumption, all (s,a) pairs are eventually visited.
Algorithm: Monte Carlo ES (Exploring Starts)
Initialize:
π(s) = arbitrary deterministic policy for all s
Q(s,a) = 0 for all s,a
Returns(s,a) = empty list for all s,a
Loop (episodes):
Choose S_0, A_0 randomly (any (s,a) with nonzero probability)
Generate episode from S_0, A_0 following π:
S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T
G ← 0
For t = T-1, T-2, ..., 0:
G ← γG + R_{t+1}
If (S_t, A_t) not in {(S_0,A_0), ..., (S_{t-1},A_{t-1})}: ← first visit
Returns(S_t, A_t).append(G)
Q(S_t, A_t) ← average(Returns(S_t, A_t))
π(S_t) ← argmax_a Q(S_t, a) ← immediate greedy improvement
Convergence: if all (s,a) pairs visited infinitely often and returns averaged, Q^π_k → Q* and π_k → π*.
Limitation: exploring starts is unrealistic — in many real applications, we can’t choose an arbitrary starting (s,a) pair. (Can’t say “start NCCL job with this specific algorithm and chunk size”.)
On-Policy MC Control (ε-Soft Policies)
Idea: use a soft policy — every action has positive probability ε/|A| of being selected, even non-greedy actions. This ensures all (s,a) pairs are visited without exploring starts.
ε-soft policy:
π(a|s) ≥ ε / |A| for all s, a
ε-greedy policy (simplest ε-soft policy):
π(a|s) = (1-ε) + ε/|A| if a = argmax_{a'} Q(s,a') ← greedy action
ε / |A| otherwise ← random action
Algorithm: On-Policy First-Visit MC Control
Initialize:
Q(s,a) = 0 for all s,a
π(a|s) = 1/|A| (uniform random) ← start fully exploratory
Returns(s,a) = empty list
Loop (episodes):
Generate episode using π: S_0,A_0,R_1,...,S_T
G ← 0
For t = T-1, T-2, ..., 0:
G ← γG + R_{t+1}
If (S_t, A_t) first visit in episode:
Returns(S_t,A_t).append(G)
Q(S_t,A_t) ← average(Returns(S_t,A_t))
A* ← argmax_a Q(S_t, a) ← greedy action
For all a ∈ A(S_t):
π(a|S_t) ← ε/|A(S_t)| ← ε-greedy update
if a == A*: π(A*|S_t) += (1 - ε)
What does this converge to? NOT the globally optimal policy π*, but the optimal ε-soft policy — best policy within the constraint that all actions have probability ≥ ε/|A|.
As ε → 0, this approaches π*.
Worked Example: Blackjack Optimal Policy
Setup (S&B Example 5.3): - Same blackjack MDP as prediction (5 million episodes) - On-policy control with ε-greedy finds near-optimal strategy
Converged policy:
Without usable ace:
Stick if sum ≥ 20 (all cases); sometimes stick at 18-19
(depends on dealer's upcard)
With usable ace:
Stick if sum ≥ 19 (generally)
Stick at lower sums when dealer shows weak card (2-6)
Comparison to optimal (known): the ε-greedy policy is near-optimal but not exactly optimal (can’t always be greedy due to ε-soft constraint).
The Key Difficulty: On-Policy vs Off-Policy
On-policy: the policy we’re evaluating is the same as the policy we’re improving. - Simple, stable - Limited: can only find optimal ε-soft policy (not true optimal)
Off-policy: evaluate/improve a target policy π, while generating data with a different behavior policy b. - More powerful: can find true optimal π* while using exploratory behavior b - More complex: need importance sampling correction
This distinction is fundamental to RL and recurs throughout the book.
MC Control Convergence Analysis
Theorem (convergence of MC ES): for finite MDPs with exploring starts and first-visit MC, if all (s,a) pairs are visited infinitely often, then Q converges to Q* and π converges to π*.
Proof sketch: - Q(s,a) → Q^{π_k}(s,a) as episodes → ∞ (LLN) - π_{k+1} = greedy(Q^{π_k}) → V^{π_{k+1}} ≥ V^{π_k} (policy improvement theorem) - Sequence of policies is monotonically improving in a finite policy space → converges
Caution: in practice, convergence is asymptotic — we can’t run infinitely many episodes. In approximate settings (neural nets), additional stability issues arise.
Sample Efficiency Comparison
On the 10-armed bandit: MC = sample mean estimate (revisited from Ch.2).
In full MDPs: MC has higher variance than TD methods because returns depend on the full episode trajectory. This means:
| Method | Samples to ε accuracy | Why |
|---|---|---|
| DP | 0 (has model) | Uses model directly |
| TD | ~1/ε² / (1-γ)² | Low variance bootstrap |
| MC | ~T/ε² | High variance from full return |
TD typically needs fewer samples than MC for the same accuracy — but MC is unbiased (no bootstrap error).
Connection to DynamICCL
On-policy MC control is the conceptual ancestor of PPO: - Both: evaluate current policy using actual rollouts (episodes/trajectories), then improve - MC control: average returns per (s,a) visit → greedy improvement - PPO: GAE advantage estimates from rollout → gradient step on clipped surrogate
PPO is essentially MC control with: (1) function approximation, (2) gradient-based improvement instead of tabular greedy, (3) TD(λ) GAE instead of pure MC returns, (4) clipping to prevent large policy changes.
In DynamICCL: each PPO rollout (collecting NCCL performance data across a training job segment) corresponds to one “episode” in the MC sense.