Planning with Learned Models and MCTS
Chapter 8 — Planning and Learning with Tabular Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 159–190
Types of Models
Distribution model: full probability distribution over next state and reward:
M̂(s', r | s, a) = P̂(S_{t+1}=s', R_{t+1}=r | S_t=s, A_t=a)
- Required for exact DP
- Expensive: must store P(s’|s,a) for all (s,a,s’) triples
- Enables sampling or exact expectation
Sample model: given (s,a), returns a sample (s’, r) from M̂:
(s', r) ~ M̂(·, · | s, a)
- Cheaper: only stores enough to sample
- Used in Dyna-Q
- Sufficient for planning via simulation (MCTS, Dyna)
In practice: Dyna typically uses deterministic sample models for tabular RL. For deep RL (world models), neural networks learn probabilistic models.
Building the Model in Dyna
Deterministic model (simplest):
Model(s, a) = (R, S') ← stores last observed (r, s') for each (s,a) pair
When (s,a) is visited again: update Model(s,a) = new (R, S’).
Stochastic model: maintain counts and transitions:
Model(s,a) = list of (R, S') observations
Sample from this list when planning
Neural network model (for large state spaces):
(R̂, Ŝ') = f_θ(s, a) ← deterministic prediction
or
(R̂, Ŝ', σ²) = f_θ(s, a) ← Gaussian prediction (for uncertainty)
Decision-Time Planning vs Background Planning
Background planning (Dyna): runs planning steps between real actions, updating Q(s,a) for states already visited. - Updates value function for future decisions - Runs in “background” between environment interactions
Decision-time planning: at decision time, use planning to evaluate actions for the current state s. - Only plans for the current state - More expensive per decision, but can be more accurate - Examples: MCTS, A* search, model predictive control
Trade-off: - Background: better for frequent decisions in known states - Decision-time: better for infrequent decisions in novel states
Monte Carlo Tree Search (MCTS)
MCTS combines decision-time planning with MC simulation:
Loop (simulations from current state s_0):
┌─────────────────────────────────────────────────────────┐
│ 1. SELECT: traverse tree from root using UCB │
│ a = argmax_{a} [Q(s,a)/N(s,a) + c√(ln N(s)/N(s,a))]│
│ │
│ 2. EXPAND: add new child node s_leaf │
│ │
│ 3. SIMULATE: fast rollout from s_leaf to terminal │
│ (random or heuristic policy) │
│ │
│ 4. BACKPROPAGATE: update Q(s,a) and N(s,a) up the tree │
│ Q(s,a) += [G - Q(s,a)] / N(s,a) │
└─────────────────────────────────────────────────────────┘
Select real action: A = argmax_a N(root, a) ← most visited = most explored
AlphaGo enhancement (Silver et al., 2016): - Replace
step 3 (random rollout) with neural network value estimate V(s_leaf; θ)
- Use prior policy P(a|s; θ) from neural network in UCB formula (PUCT):
a = argmax_a [Q(s,a) + c P(a|s; θ) √(N(s)) / (1 + N(s,a))]
- Result: dramatically better than pure MCTS or pure neural network
Rollout Algorithms
Rollout algorithm: a decision-time planning method that uses simulation rollouts from the current state.
For each action a ∈ A(s_current):
Simulate k rollouts starting with action a, then follow π_rollout
Q̂(s,a) ← mean return over k rollouts
Choose: a* = argmax_a Q̂(s,a)
Simple and powerful: if π_rollout is reasonably good, rollout algorithms can dramatically improve over π_rollout in one step. This is one application of policy improvement.
Connection to AlphaZero: rollout + neural value function = AlphaZero’s playouts. The better the value estimate, the fewer rollouts needed.
Heuristic Search as Planning
Classical AI search (A, IDA, beam search) can be viewed as decision-time planning:
A* heuristic: f(s) = g(s) + h(s)
g(s) = cost to reach s (known)
h(s) = estimated cost to goal (heuristic = V*(s) approximation)
If h(s) = V(s) exactly: A is equivalent to value iteration from the start state. If h(s) is admissible (h(s) ≤ actual cost): A* guarantees optimal path.
RL connection: learning V(s) is equivalent to learning the perfect heuristic for A. Model-based RL = learning both M̂ and V*, then using them for planning.
World Models (Neural Network Models)
Modern model-based RL (DreamerV2/V3, MBPO, PlaNet) uses neural world models:
World model:
Encoder: z_t = enc(o_t) (observation → latent state)
Dynamics: z_{t+1} = dyn(z_t, a_t) + noise (latent transition)
Reward: r̂_t = rew(z_t, a_t)
Decoder: ô_t = dec(z_t) (reconstruct observation)
Planning in latent space: rollout imagined trajectories entirely in the latent space, without running the real environment.
DreamerV3 (Hafner et al., 2023): state-of-the-art world model RL. Achieves superhuman performance on many tasks purely from imagined trajectories.
Relevant to DynamICCL: if a neural model of NCCL throughput as a function of (network state, configuration) were trained, DreamerV3-style planning could find optimal configurations in seconds using imagined rollouts.
Chapter 8 Summary: Planning-Learning Spectrum
Pure Model-Free Pure Model-Based
│ │
Q-learning DP with known model
SARSA MCTS + perfect sim
│ │
│ Dyna (model-free + planning) │
│ Rollout algorithms │
│ DreamerV3 (world models) │
└──────────────────────────────────────────┘
The fundamental tradeoff: - More model = more sample efficient (can plan) - More model = more model error (wrong model → wrong policy) - Model-free: no model error, but needs many real interactions
Current best practice: - Small MDPs: full DP or Dyna - Large discrete MDPs: MCTS + neural value/policy - Large continuous MDPs: SAC/PPO (model-free) or DreamerV3 (world model) - DynamICCL: currently model-free PPO; future direction = Dyna-PPO with network simulator
Connection to DynamICCL
The Dyna architecture directly applies to DynamICCL:
DynamICCL-Dyna (proposed):
Real step: execute NCCL config, observe throughput (expensive!)
Model update: update neural network model of throughput
Planning: run n imagined rollouts from current state using model
→ n additional PPO gradient steps on imagined data
Result: 10-50× better sample efficiency than pure model-free
This is the key architectural upgrade path for DynamICCL. The main challenge: building an accurate model of how network state + NCCL config → throughput is non-trivial (requires capturing queuing effects, contention, heterogeneous hardware).