Dyna: Integrating Planning, Acting, and Learning
Chapter 8 — Planning and Learning with Tabular Methods
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 159–175
Two Ways to Use Experience
Direct RL (model-free): use experience to directly improve policy/value function. - Q-learning, SARSA: each (s,a,r,s’) tuple updates Q immediately - Pros: simple, no model error; Cons: sample inefficient
Model-based RL: use experience to build a model M̂(s’|s,a), then use M̂ to generate simulated experience for additional updates. - Pros: sample efficient (one real experience → many simulated updates); Cons: model error can hurt
Dyna (Sutton, 1990): do both simultaneously!
The Dyna Architecture
┌─────────────────────────────────────────────────────────────┐
│ │
│ Real Experience (s,a,r,s') → Direct RL Learning │
│ │ │
│ │ Direct RL (Q-learning) │
│ ▼ │
│ Model Update: M̂(s'|s,a), r̂(s,a) │
│ │ │
│ ▼ │
│ Simulated Experience: sample (s,a) → M̂ → (r,s') │
│ │ │
│ │ Planning (RL on model) │
│ ▼ │
│ Q(s,a) updated from simulated experience (n times) │
│ │
│ Policy: ε-greedy on Q → next real action │
└─────────────────────────────────────────────────────────────┘
Two uses of experience: 1. Learning: update model M̂ from real (s,a,r,s’) 2. Planning: use M̂ to generate simulated experience → update Q (multiple times)
Algorithm: Dyna-Q
Initialize: Q(s,a) = 0, Model(s,a) = empty for all s,a
Loop (episodes):
Initialize S
For each step:
(a) A ← ε-greedy(Q, S) ← choose action
(b) Execute A; observe R, S' ← real environment step
(c) Q(S,A) ← Q(S,A) + α[R + γ max_{a'} Q(S',a') - Q(S,A)] ← direct RL
(d) Model(S,A) ← (R, S') ← model learning (deterministic model)
(e) Loop n times: ← planning
s ← random previously-visited state
a ← random action previously taken in s
R̂, Ŝ' ← Model(s,a) ← simulated experience
Q(s,a) ← Q(s,a) + α[R̂ + γ max_{a''} Q(Ŝ',a'') - Q(s,a)] ← planning update
(f) S ← S'
The n planning steps (step e) are the key novelty. Each planning step updates Q using simulated (model-generated) experience — essentially running Q-learning on the model.
Dyna-Q Performance
Experiment (S&B Example 8.1 — Maze): - 47-state maze; goal in upper right; episode starts at lower left - Reward: +1 at goal, 0 elsewhere; no discounting
Comparison of Dyna-Q with different n:
n=0 (pure model-free Q-learning): ~800 steps to first solution in episode 2
n=5 (5 planning steps per real step): ~400 steps
n=50 (50 planning steps): ~80 steps
Interpretation: 50 planning steps effectively use each real experience 51 times (1 direct + 50 planning). The model “amplifies” each real experience.
Sample efficiency: Dyna-Q with n=50 reaches near-optimal policy 10× faster than pure Q-learning, using the same number of real environment steps.
Dyna-Q+: Exploration Bonus for Changing Environments
Problem: Dyna-Q doesn’t re-explore parts of the state space that it has “learned” — if the environment changes, the model becomes stale.
Dyna-Q+: adds an exploration bonus to actions that haven’t been taken recently:
Bonus reward: r + κ√τ where τ = number of steps since (s,a) last tried
κ: exploration bonus constant (e.g., κ = 0.001)
This encourages revisiting old state-action pairs — detecting if the environment has changed.
Shortcut maze experiment (S&B Example 8.3): - Original maze: optimal path = 18 steps (top-left to goal) - At step 1000: shortcut opens (new 6-step path appears) - Standard Dyna-Q: continues using old 18-step path for long time - Dyna-Q+: discovers new shortcut within ~200 steps
Dyna-Q+ is more robust to nonstationary environments — important for DynamICCL where network topology and workloads change during training.
Prioritized Sweeping
Problem with random planning steps: wastes time updating states with small Bellman error.
Prioritized Sweeping: focus planning on states where the value will change the most.
Priority Queue PQ (ordered by max Bellman error)
Real step:
Execute (S,A), observe (R, S')
Update Q(S,A), model
Compute P = |R + γ max Q(S',a') - Q(S,A)|
If P > θ: insert (S,A) into PQ with priority P
Planning loop (n iterations):
(s,a) ← PQ.pop_max() ← highest priority (s,a)
Update Q(s,a) from model
For each (s̄, ā) that leads to s (predecessor pairs):
Compute priority P̄ for (s̄, ā)
If P̄ > θ: insert (s̄, ā) into PQ
Effect: focuses updates on “important” (s,a) pairs — those where the value is currently wrong.
Speedup: prioritized sweeping typically requires 5-10× fewer updates than random sweeping to converge.
Connection to deep RL: Prioritized Experience Replay (PER, Schaul et al., 2015) applies the same idea to DQN — samples transitions with large TD error more frequently.
When Does Planning Help?
Planning helps most when: 1. Real environment steps are expensive (robot interaction, GPU training jobs) 2. Simulated steps are cheap (simple model) 3. Model accuracy is high
Planning hurts when: 1. Model is inaccurate (model errors compound over many simulated steps) 2. Simulated steps are expensive (complex model, slow simulator) 3. Real environment has many random/nonstationary components
For DynamICCL: real environment steps (NCCL training steps) are very expensive. If a fast accurate model of network throughput could be built, planning would give massive sample efficiency gains — this is the “NCCL simulator” research direction.
Connection to DynamICCL
| Dyna Concept | DynamICCL Analog |
|---|---|
| Real experience | NCCL performance measurements during training |
| Model M̂(s’ | s,a) |
| Planning steps | Imagined NCCL rollouts using simulator |
| Dyna-Q | DreamerV3/MBPO if applied to NCCL tuning |
| Prioritized sweeping | PER in DQN-based NCCL optimizer |
| Dyna-Q+ bonus | UCB/entropy bonus in PPO for exploration |
Current DynamICCL (model-free PPO) doesn’t use planning. Upgrading to Dyna-PPO (model-based) would dramatically reduce the number of real training jobs needed to learn a good policy — a key research direction.