Off-Policy Monte Carlo and Importance Sampling

Chapter 5 — Monte Carlo Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 104–122

The Off-Policy Problem

Motivation: we want to learn the optimal policy π* (target policy), but we need to explore to gather data — exploration requires taking suboptimal actions. On-policy methods are stuck: they can only find the best exploratory policy, not the true optimal.

Off-policy solution: separate the policy that generates data (behavior policy b) from the policy we want to learn about (target policy π).

Behavior policy b: generates episodes (must be exploratory)
Target policy π:   the policy we want to evaluate/improve

Requirement: coverage — b(a|s) > 0 wherever π(a|s) > 0
            (b explores all actions that π might take)

Examples: - b = ε-greedy (explore), π = greedy (target) - b = human play, π = learned optimal policy - b = random, π = learned policy (Q-learning is this!)

Importance Sampling

Problem: episodes are generated under b, but we want to estimate E_π[G_t].

The distribution of trajectory τ = (S_t, A_t, R_{t+1}, …, S_T) under policy b:

Prob_b(τ) = Π_{k=t}^{T-1} b(A_k|S_k) · p(S_{k+1}|S_k,A_k)

Under π:

Prob_π(τ) = Π_{k=t}^{T-1} π(A_k|S_k) · p(S_{k+1}|S_k,A_k)

Importance sampling ratio (IS ratio):

ρ_{t:T-1} = Prob_π(τ) / Prob_b(τ)
           = Π_{k=t}^{T-1} [π(A_k|S_k) / b(A_k|S_k)]

Note: the transition probabilities p(S_{k+1}|S_k,A_k) cancel! IS ratio depends only on the policies.

Ordinary Importance Sampling (OIS):

V^π(s) ≈ V̂(s) = (Σ_{t ∈ T(s)} ρ_{t:T(t)-1} G_t) / |T(s)|

where T(s) = set of time steps when S_t = s, T(t) = terminal time of episode containing t.

Weighted Importance Sampling (WIS):

V̂(s) = (Σ_{t ∈ T(s)} ρ_{t:T(t)-1} G_t) / Σ_{t ∈ T(s)} ρ_{t:T(t)-1}

Ordinary vs Weighted IS: Bias-Variance Tradeoff

	Ordinary IS (OIS)	Weighted IS (WIS)
Bias	Unbiased	Biased (consistent — converges as n→∞)
Variance	Can be infinite (if ρ >> 1)	Bounded variance
Practical	Often unstable	More stable, preferred
Convergence	May fail for heavy-tailed ρ	Converges but biased

OIS variance explosion: if π(a|s)/b(a|s) can be very large (e.g., π deterministic, b uniform), single trajectories dominate the estimate.

Example: k=10 arms, b = uniform (prob 1/10 each), π = greedy on arm 1 (prob 1 on arm 1). For a trajectory that takes arm 1 at every step: ρ = 1/(1/10)^T = 10^T. For T=100: ρ = 10^100 — enormous variance!

WIS fix: normalize by the total weight, so single trajectories are bounded in their influence.

IS with Incremental Updates

Online (incremental) computation of WIS, as each episode completes:

Initialize: C(s,a) = 0, Q(s,a) = 0 for all s,a

For each episode:
  Generate episode using b: S_0, A_0, R_1, ..., S_T
  G ← 0
  W ← 1.0    ← importance weight (initialized to 1)

  For t = T-1, T-2, ..., 0:
    G ← γG + R_{t+1}
    C(S_t, A_t) ← C(S_t, A_t) + W
    Q(S_t, A_t) ← Q(S_t, A_t) + (W / C(S_t, A_t)) · (G - Q(S_t, A_t))
    W ← W · π(A_t|S_t) / b(A_t|S_t)      ← update IS ratio
    If W = 0: break    ← π would never take this action; stop updating

Note: we can break early (W=0) when π is deterministic and A_t ≠ π(S_t), since the episode trajectory is irrelevant beyond this point.

Off-Policy Control: Algorithm

Combining off-policy prediction with policy improvement:

Algorithm: Off-Policy MC Control (Weighted IS)

Initialize:
  Q(s,a) = 0, C(s,a) = 0 for all s,a
  π(s) = argmax_a Q(s,a)  (arbitrary at start; deterministic target)

Loop (episodes):
  b = soft policy (e.g., ε-greedy w.r.t. Q, or uniform)
  Generate episode: S_0, A_0, R_1, ..., S_T using b
  G ← 0
  W ← 1.0

  For t = T-1, T-2, ..., 0:
    G ← γG + R_{t+1}
    C(S_t, A_t) ← C(S_t, A_t) + W
    Q(S_t, A_t) ← Q(S_t, A_t) + (W/C(S_t, A_t))(G - Q(S_t, A_t))
    π(S_t) ← argmax_a Q(S_t, a)               ← greedy improvement
    If A_t ≠ π(S_t): break    ← W becomes 0 past this point
    W ← W / b(A_t|S_t)        ← π(A_t|S_t) = 1 (deterministic), so W ← W/b

Why break when A_t ≠ π(S_t): if the target policy π is deterministic and we took action A_t ≠ π(S_t), then π(A_t|S_t) = 0, so ρ = 0 and all earlier states get zero weight. The backward update terminates.

Convergence: converges to Q* and π* given: 1. Coverage: b(a|s) > 0 wherever π(a|s) > 0 2. All (s,a) pairs visited infinitely often under b

Practical Issues with Off-Policy MC

1. High Variance / Sample Inefficiency

IS ratios can be extremely large or small. For long episodes:

ρ_{t:T-1} = Π_{k=t}^{T-1} [π(A_k|S_k)/b(A_k|S_k)]

With T=100 steps and ratio per step of 0.9: ρ ≈ 0.9^100 ≈ 2.66×10^{-5}. With ratio of 1.1 per step: ρ ≈ 1.1^100 ≈ 13,780.

Variance grows exponentially with episode length — off-policy MC is extremely sample inefficient for long episodes.

2. Per-Decision IS (Partial Importance Sampling)

Better approach: only use IS ratio for steps at and after time t:

ρ_{t:T-1} G_t = Σ_{k=t}^{T-1} (Π_{j=t}^{k} π(A_j|S_j)/b(A_j|S_j)) γ^{k-t} R_{k+1}

This “per-decision” IS has lower variance while remaining unbiased.

3. Truncated IS

Cap IS ratio at some maximum: ρ̄ = min(ρ, c) for constant c.

Reduces variance at cost of introducing bias
Used in V-trace (IMPALA) for distributed RL — a production algorithm

Conceptual Importance: Off-Policy = Fundamental RL Tool

Off-policy learning enables: 1. Experience replay: store past experiences (s,a,r,s’) from old policies; reuse for current policy → more sample efficient 2. Learning from demonstrations: behavior policy = human expert; target policy = learned optimal 3. Q-learning (Ch.6): b = ε-greedy (behavior), π = greedy (target) — this is exactly off-policy learning! 4. Parallel experience collection: multiple actors with various policies contribute to a single learner

The idea of correcting for distributional mismatch (b vs π) via importance weights reappears throughout modern RL: - PPO’s clipped IS ratio (||π/π_old||_∞ ≤ 1+ε) - V-trace in IMPALA - ACER (Actor-Critic with Experience Replay)

Connection to DynamICCL

Off-policy MC is the conceptual foundation for experience replay and off-policy learning in DynamICCL: - If DynamICCL uses SAC (Soft Actor-Critic), it is an off-policy algorithm — stores all experience in a replay buffer and samples from it, reweighting with approximate IS - PPO (which DynamICCL uses) is approximately on-policy but allows small ratio deviations (clipped IS) — a practical compromise between off-policy efficiency and on-policy stability - The variance problem of off-policy MC (exponential IS ratio growth) is why PPO clips the ratio to [1-ε, 1+ε] — preventing the instability while allowing limited off-policy learning from the previous few policy versions