Planning with Learned Models and MCTS

Chapter 8 — Planning and Learning with Tabular Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 159–190

Types of Models

Distribution model: full probability distribution over next state and reward:

M̂(s', r | s, a) = P̂(S_{t+1}=s', R_{t+1}=r | S_t=s, A_t=a)

Required for exact DP
Expensive: must store P(s’|s,a) for all (s,a,s’) triples
Enables sampling or exact expectation

Sample model: given (s,a), returns a sample (s’, r) from M̂:

(s', r) ~ M̂(·, · | s, a)

Cheaper: only stores enough to sample
Used in Dyna-Q
Sufficient for planning via simulation (MCTS, Dyna)

In practice: Dyna typically uses deterministic sample models for tabular RL. For deep RL (world models), neural networks learn probabilistic models.

Building the Model in Dyna

Deterministic model (simplest):

Model(s, a) = (R, S')    ← stores last observed (r, s') for each (s,a) pair

When (s,a) is visited again: update Model(s,a) = new (R, S’).

Stochastic model: maintain counts and transitions:

Model(s,a) = list of (R, S') observations
Sample from this list when planning

Neural network model (for large state spaces):

(R̂, Ŝ') = f_θ(s, a)    ← deterministic prediction
or
(R̂, Ŝ', σ²) = f_θ(s, a) ← Gaussian prediction (for uncertainty)

Decision-Time Planning vs Background Planning

Background planning (Dyna): runs planning steps between real actions, updating Q(s,a) for states already visited. - Updates value function for future decisions - Runs in “background” between environment interactions

Decision-time planning: at decision time, use planning to evaluate actions for the current state s. - Only plans for the current state - More expensive per decision, but can be more accurate - Examples: MCTS, A* search, model predictive control

Trade-off: - Background: better for frequent decisions in known states - Decision-time: better for infrequent decisions in novel states

Monte Carlo Tree Search (MCTS)

MCTS combines decision-time planning with MC simulation:

Loop (simulations from current state s_0):
  ┌─────────────────────────────────────────────────────────┐
  │ 1. SELECT: traverse tree from root using UCB           │
  │    a = argmax_{a} [Q(s,a)/N(s,a) + c√(ln N(s)/N(s,a))]│
  │                                                         │
  │ 2. EXPAND: add new child node s_leaf                   │
  │                                                         │
  │ 3. SIMULATE: fast rollout from s_leaf to terminal       │
  │    (random or heuristic policy)                        │
  │                                                         │
  │ 4. BACKPROPAGATE: update Q(s,a) and N(s,a) up the tree │
  │    Q(s,a) += [G - Q(s,a)] / N(s,a)                    │
  └─────────────────────────────────────────────────────────┘

Select real action: A = argmax_a N(root, a)   ← most visited = most explored

AlphaGo enhancement (Silver et al., 2016): - Replace step 3 (random rollout) with neural network value estimate V(s_leaf; θ) - Use prior policy P(a|s; θ) from neural network in UCB formula (PUCT): a = argmax_a [Q(s,a) + c P(a|s; θ) √(N(s)) / (1 + N(s,a))] - Result: dramatically better than pure MCTS or pure neural network

Rollout Algorithms

Rollout algorithm: a decision-time planning method that uses simulation rollouts from the current state.

For each action a ∈ A(s_current):
  Simulate k rollouts starting with action a, then follow π_rollout
  Q̂(s,a) ← mean return over k rollouts
Choose: a* = argmax_a Q̂(s,a)

Simple and powerful: if π_rollout is reasonably good, rollout algorithms can dramatically improve over π_rollout in one step. This is one application of policy improvement.

Connection to AlphaZero: rollout + neural value function = AlphaZero’s playouts. The better the value estimate, the fewer rollouts needed.

Heuristic Search as Planning

Classical AI search (A, IDA, beam search) can be viewed as decision-time planning:

A* heuristic: f(s) = g(s) + h(s)
  g(s) = cost to reach s (known)
  h(s) = estimated cost to goal (heuristic = V*(s) approximation)

If h(s) = V(s) exactly: A is equivalent to value iteration from the start state. If h(s) is admissible (h(s) ≤ actual cost): A* guarantees optimal path.

RL connection: learning V(s) is equivalent to learning the perfect heuristic for A. Model-based RL = learning both M̂ and V*, then using them for planning.

World Models (Neural Network Models)

Modern model-based RL (DreamerV2/V3, MBPO, PlaNet) uses neural world models:

World model:
  Encoder:     z_t = enc(o_t)       (observation → latent state)
  Dynamics:    z_{t+1} = dyn(z_t, a_t) + noise   (latent transition)
  Reward:      r̂_t = rew(z_t, a_t)
  Decoder:     ô_t = dec(z_t)       (reconstruct observation)

Planning in latent space: rollout imagined trajectories entirely in the latent space, without running the real environment.

DreamerV3 (Hafner et al., 2023): state-of-the-art world model RL. Achieves superhuman performance on many tasks purely from imagined trajectories.

Relevant to DynamICCL: if a neural model of NCCL throughput as a function of (network state, configuration) were trained, DreamerV3-style planning could find optimal configurations in seconds using imagined rollouts.

Chapter 8 Summary: Planning-Learning Spectrum

Pure Model-Free                          Pure Model-Based
     │                                          │
   Q-learning                    DP with known model
   SARSA                         MCTS + perfect sim
     │                                          │
     │         Dyna (model-free + planning)     │
     │         Rollout algorithms               │
     │         DreamerV3 (world models)         │
     └──────────────────────────────────────────┘

The fundamental tradeoff: - More model = more sample efficient (can plan) - More model = more model error (wrong model → wrong policy) - Model-free: no model error, but needs many real interactions

Current best practice: - Small MDPs: full DP or Dyna - Large discrete MDPs: MCTS + neural value/policy - Large continuous MDPs: SAC/PPO (model-free) or DreamerV3 (world model) - DynamICCL: currently model-free PPO; future direction = Dyna-PPO with network simulator

Connection to DynamICCL

The Dyna architecture directly applies to DynamICCL:

DynamICCL-Dyna (proposed):
  Real step: execute NCCL config, observe throughput (expensive!)
  Model update: update neural network model of throughput
  Planning: run n imagined rollouts from current state using model
              → n additional PPO gradient steps on imagined data
  Result: 10-50× better sample efficiency than pure model-free

This is the key architectural upgrade path for DynamICCL. The main challenge: building an accurate model of how network state + NCCL config → throughput is non-trivial (requires capturing queuing effects, contention, heterogeneous hardware).