Monte Carlo Control

Chapter 5 — Monte Carlo Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 99–109

The Control Problem with MC

Goal: find the optimal policy π* using only experience (no model).

GPI with MC: 1. Evaluation: estimate Q^π(s,a) using MC (average returns from (s,a)) 2. Improvement: update π to be greedy w.r.t. Q^π

Challenge: if the policy is deterministic (π(s) = a for all s), many (s,a) pairs are never visited — Q^π(s, a’) for a’ ≠ π(s) is never updated. Policy can’t improve past what it’s already trying.

Solution: ensure all (s,a) pairs are tried → exploration.

Exploring Starts (Monte Carlo ES)

Assumption: every episode can start with any (s, a) pair with nonzero probability (“exploring starts”).

With this assumption, all (s,a) pairs are eventually visited.

Algorithm: Monte Carlo ES (Exploring Starts)

Initialize:
  π(s) = arbitrary deterministic policy for all s
  Q(s,a) = 0 for all s,a
  Returns(s,a) = empty list for all s,a

Loop (episodes):
  Choose S_0, A_0 randomly (any (s,a) with nonzero probability)
  Generate episode from S_0, A_0 following π:
    S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T

  G ← 0
  For t = T-1, T-2, ..., 0:
    G ← γG + R_{t+1}
    If (S_t, A_t) not in {(S_0,A_0), ..., (S_{t-1},A_{t-1})}:  ← first visit
      Returns(S_t, A_t).append(G)
      Q(S_t, A_t) ← average(Returns(S_t, A_t))
      π(S_t) ← argmax_a Q(S_t, a)          ← immediate greedy improvement

Convergence: if all (s,a) pairs visited infinitely often and returns averaged, Q^π_k → Q* and π_k → π*.

Limitation: exploring starts is unrealistic — in many real applications, we can’t choose an arbitrary starting (s,a) pair. (Can’t say “start NCCL job with this specific algorithm and chunk size”.)

On-Policy MC Control (ε-Soft Policies)

Idea: use a soft policy — every action has positive probability ε/|A| of being selected, even non-greedy actions. This ensures all (s,a) pairs are visited without exploring starts.

ε-soft policy:

π(a|s) ≥ ε / |A|   for all s, a

ε-greedy policy (simplest ε-soft policy):

π(a|s) = (1-ε) + ε/|A|    if a = argmax_{a'} Q(s,a')   ← greedy action
         ε / |A|           otherwise                     ← random action

Algorithm: On-Policy First-Visit MC Control

Initialize:
  Q(s,a) = 0 for all s,a
  π(a|s) = 1/|A| (uniform random)    ← start fully exploratory
  Returns(s,a) = empty list

Loop (episodes):
  Generate episode using π: S_0,A_0,R_1,...,S_T

  G ← 0
  For t = T-1, T-2, ..., 0:
    G ← γG + R_{t+1}
    If (S_t, A_t) first visit in episode:
      Returns(S_t,A_t).append(G)
      Q(S_t,A_t) ← average(Returns(S_t,A_t))

      A* ← argmax_a Q(S_t, a)                     ← greedy action
      For all a ∈ A(S_t):
        π(a|S_t) ← ε/|A(S_t)|                     ← ε-greedy update
        if a == A*: π(A*|S_t) += (1 - ε)

What does this converge to? NOT the globally optimal policy π*, but the optimal ε-soft policy — best policy within the constraint that all actions have probability ≥ ε/|A|.

As ε → 0, this approaches π*.

Worked Example: Blackjack Optimal Policy

Setup (S&B Example 5.3): - Same blackjack MDP as prediction (5 million episodes) - On-policy control with ε-greedy finds near-optimal strategy

Converged policy:

Without usable ace:
  Stick if sum ≥ 20 (all cases); sometimes stick at 18-19
  (depends on dealer's upcard)

With usable ace:
  Stick if sum ≥ 19 (generally)
  Stick at lower sums when dealer shows weak card (2-6)

Comparison to optimal (known): the ε-greedy policy is near-optimal but not exactly optimal (can’t always be greedy due to ε-soft constraint).

The Key Difficulty: On-Policy vs Off-Policy

On-policy: the policy we’re evaluating is the same as the policy we’re improving. - Simple, stable - Limited: can only find optimal ε-soft policy (not true optimal)

Off-policy: evaluate/improve a target policy π, while generating data with a different behavior policy b. - More powerful: can find true optimal π* while using exploratory behavior b - More complex: need importance sampling correction

This distinction is fundamental to RL and recurs throughout the book.

MC Control Convergence Analysis

Theorem (convergence of MC ES): for finite MDPs with exploring starts and first-visit MC, if all (s,a) pairs are visited infinitely often, then Q converges to Q* and π converges to π*.

Proof sketch: - Q(s,a) → Q^{π_k}(s,a) as episodes → ∞ (LLN) - π_{k+1} = greedy(Q^{π_k}) → V^{π_{k+1}} ≥ V^{π_k} (policy improvement theorem) - Sequence of policies is monotonically improving in a finite policy space → converges

Caution: in practice, convergence is asymptotic — we can’t run infinitely many episodes. In approximate settings (neural nets), additional stability issues arise.

Sample Efficiency Comparison

On the 10-armed bandit: MC = sample mean estimate (revisited from Ch.2).

In full MDPs: MC has higher variance than TD methods because returns depend on the full episode trajectory. This means:

Method	Samples to ε accuracy	Why
DP	0 (has model)	Uses model directly
TD	~1/ε² / (1-γ)²	Low variance bootstrap
MC	~T/ε²	High variance from full return

TD typically needs fewer samples than MC for the same accuracy — but MC is unbiased (no bootstrap error).

Connection to DynamICCL

On-policy MC control is the conceptual ancestor of PPO: - Both: evaluate current policy using actual rollouts (episodes/trajectories), then improve - MC control: average returns per (s,a) visit → greedy improvement - PPO: GAE advantage estimates from rollout → gradient step on clipped surrogate

PPO is essentially MC control with: (1) function approximation, (2) gradient-based improvement instead of tabular greedy, (3) TD(λ) GAE instead of pure MC returns, (4) clipping to prevent large policy changes.

In DynamICCL: each PPO rollout (collecting NCCL performance data across a training job segment) corresponds to one “episode” in the MC sense.