Off-Policy Monte Carlo and Importance Sampling
Chapter 5 — Monte Carlo Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 104–122
The Off-Policy Problem
Motivation: we want to learn the optimal policy π* (target policy), but we need to explore to gather data — exploration requires taking suboptimal actions. On-policy methods are stuck: they can only find the best exploratory policy, not the true optimal.
Off-policy solution: separate the policy that generates data (behavior policy b) from the policy we want to learn about (target policy π).
Behavior policy b: generates episodes (must be exploratory)
Target policy π: the policy we want to evaluate/improve
Requirement: coverage — b(a|s) > 0 wherever π(a|s) > 0
(b explores all actions that π might take)
Examples: - b = ε-greedy (explore), π = greedy (target) - b = human play, π = learned optimal policy - b = random, π = learned policy (Q-learning is this!)
Importance Sampling
Problem: episodes are generated under b, but we want to estimate E_π[G_t].
The distribution of trajectory τ = (S_t, A_t, R_{t+1}, …, S_T) under policy b:
Prob_b(τ) = Π_{k=t}^{T-1} b(A_k|S_k) · p(S_{k+1}|S_k,A_k)
Under π:
Prob_π(τ) = Π_{k=t}^{T-1} π(A_k|S_k) · p(S_{k+1}|S_k,A_k)
Importance sampling ratio (IS ratio):
ρ_{t:T-1} = Prob_π(τ) / Prob_b(τ)
= Π_{k=t}^{T-1} [π(A_k|S_k) / b(A_k|S_k)]
Note: the transition probabilities p(S_{k+1}|S_k,A_k) cancel! IS ratio depends only on the policies.
Ordinary Importance Sampling (OIS):
V^π(s) ≈ V̂(s) = (Σ_{t ∈ T(s)} ρ_{t:T(t)-1} G_t) / |T(s)|
where T(s) = set of time steps when S_t = s, T(t) = terminal time of episode containing t.
Weighted Importance Sampling (WIS):
V̂(s) = (Σ_{t ∈ T(s)} ρ_{t:T(t)-1} G_t) / Σ_{t ∈ T(s)} ρ_{t:T(t)-1}
Ordinary vs Weighted IS: Bias-Variance Tradeoff
| Ordinary IS (OIS) | Weighted IS (WIS) | |
|---|---|---|
| Bias | Unbiased | Biased (consistent — converges as n→∞) |
| Variance | Can be infinite (if ρ >> 1) | Bounded variance |
| Practical | Often unstable | More stable, preferred |
| Convergence | May fail for heavy-tailed ρ | Converges but biased |
OIS variance explosion: if π(a|s)/b(a|s) can be very large (e.g., π deterministic, b uniform), single trajectories dominate the estimate.
Example: k=10 arms, b = uniform (prob 1/10 each), π = greedy on arm 1 (prob 1 on arm 1). For a trajectory that takes arm 1 at every step: ρ = 1/(1/10)^T = 10^T. For T=100: ρ = 10^100 — enormous variance!
WIS fix: normalize by the total weight, so single trajectories are bounded in their influence.
IS with Incremental Updates
Online (incremental) computation of WIS, as each episode completes:
Initialize: C(s,a) = 0, Q(s,a) = 0 for all s,a
For each episode:
Generate episode using b: S_0, A_0, R_1, ..., S_T
G ← 0
W ← 1.0 ← importance weight (initialized to 1)
For t = T-1, T-2, ..., 0:
G ← γG + R_{t+1}
C(S_t, A_t) ← C(S_t, A_t) + W
Q(S_t, A_t) ← Q(S_t, A_t) + (W / C(S_t, A_t)) · (G - Q(S_t, A_t))
W ← W · π(A_t|S_t) / b(A_t|S_t) ← update IS ratio
If W = 0: break ← π would never take this action; stop updating
Note: we can break early (W=0) when π is deterministic and A_t ≠ π(S_t), since the episode trajectory is irrelevant beyond this point.
Off-Policy Control: Algorithm
Combining off-policy prediction with policy improvement:
Algorithm: Off-Policy MC Control (Weighted IS)
Initialize:
Q(s,a) = 0, C(s,a) = 0 for all s,a
π(s) = argmax_a Q(s,a) (arbitrary at start; deterministic target)
Loop (episodes):
b = soft policy (e.g., ε-greedy w.r.t. Q, or uniform)
Generate episode: S_0, A_0, R_1, ..., S_T using b
G ← 0
W ← 1.0
For t = T-1, T-2, ..., 0:
G ← γG + R_{t+1}
C(S_t, A_t) ← C(S_t, A_t) + W
Q(S_t, A_t) ← Q(S_t, A_t) + (W/C(S_t, A_t))(G - Q(S_t, A_t))
π(S_t) ← argmax_a Q(S_t, a) ← greedy improvement
If A_t ≠ π(S_t): break ← W becomes 0 past this point
W ← W / b(A_t|S_t) ← π(A_t|S_t) = 1 (deterministic), so W ← W/b
Why break when A_t ≠ π(S_t): if the target policy π is deterministic and we took action A_t ≠ π(S_t), then π(A_t|S_t) = 0, so ρ = 0 and all earlier states get zero weight. The backward update terminates.
Convergence: converges to Q* and π* given: 1. Coverage: b(a|s) > 0 wherever π(a|s) > 0 2. All (s,a) pairs visited infinitely often under b
Practical Issues with Off-Policy MC
1. High Variance / Sample Inefficiency
IS ratios can be extremely large or small. For long episodes:
ρ_{t:T-1} = Π_{k=t}^{T-1} [π(A_k|S_k)/b(A_k|S_k)]
With T=100 steps and ratio per step of 0.9: ρ ≈ 0.9^100 ≈ 2.66×10^{-5}. With ratio of 1.1 per step: ρ ≈ 1.1^100 ≈ 13,780.
Variance grows exponentially with episode length — off-policy MC is extremely sample inefficient for long episodes.
2. Per-Decision IS (Partial Importance Sampling)
Better approach: only use IS ratio for steps at and after time t:
ρ_{t:T-1} G_t = Σ_{k=t}^{T-1} (Π_{j=t}^{k} π(A_j|S_j)/b(A_j|S_j)) γ^{k-t} R_{k+1}
This “per-decision” IS has lower variance while remaining unbiased.
3. Truncated IS
Cap IS ratio at some maximum: ρ̄ = min(ρ, c) for constant c.
- Reduces variance at cost of introducing bias
- Used in V-trace (IMPALA) for distributed RL — a production algorithm
Conceptual Importance: Off-Policy = Fundamental RL Tool
Off-policy learning enables: 1. Experience replay: store past experiences (s,a,r,s’) from old policies; reuse for current policy → more sample efficient 2. Learning from demonstrations: behavior policy = human expert; target policy = learned optimal 3. Q-learning (Ch.6): b = ε-greedy (behavior), π = greedy (target) — this is exactly off-policy learning! 4. Parallel experience collection: multiple actors with various policies contribute to a single learner
The idea of correcting for distributional mismatch (b vs π) via importance weights reappears throughout modern RL: - PPO’s clipped IS ratio (||π/π_old||_∞ ≤ 1+ε) - V-trace in IMPALA - ACER (Actor-Critic with Experience Replay)
Connection to DynamICCL
Off-policy MC is the conceptual foundation for experience replay and off-policy learning in DynamICCL: - If DynamICCL uses SAC (Soft Actor-Critic), it is an off-policy algorithm — stores all experience in a replay buffer and samples from it, reweighting with approximate IS - PPO (which DynamICCL uses) is approximately on-policy but allows small ratio deviations (clipped IS) — a practical compromise between off-policy efficiency and on-policy stability - The variance problem of off-policy MC (exponential IS ratio growth) is why PPO clips the ratio to [1-ε, 1+ε] — preventing the instability while allowing limited off-policy learning from the previous few policy versions