Dyna: Integrating Planning, Acting, and Learning

Chapter 8 — Planning and Learning with Tabular Methods

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 159–175

Two Ways to Use Experience

Direct RL (model-free): use experience to directly improve policy/value function. - Q-learning, SARSA: each (s,a,r,s’) tuple updates Q immediately - Pros: simple, no model error; Cons: sample inefficient

Model-based RL: use experience to build a model M̂(s’|s,a), then use M̂ to generate simulated experience for additional updates. - Pros: sample efficient (one real experience → many simulated updates); Cons: model error can hurt

Dyna (Sutton, 1990): do both simultaneously!

The Dyna Architecture

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  Real Experience (s,a,r,s')  →  Direct RL Learning         │
│                 │                                           │
│                 │         Direct RL (Q-learning)           │
│                 ▼                                           │
│  Model Update: M̂(s'|s,a), r̂(s,a)                          │
│                 │                                           │
│                 ▼                                           │
│  Simulated Experience: sample (s,a) → M̂ → (r,s')          │
│                 │                                           │
│                 │         Planning (RL on model)            │
│                 ▼                                           │
│  Q(s,a) updated from simulated experience (n times)        │
│                                                             │
│  Policy: ε-greedy on Q → next real action                  │
└─────────────────────────────────────────────────────────────┘

Two uses of experience: 1. Learning: update model M̂ from real (s,a,r,s’) 2. Planning: use M̂ to generate simulated experience → update Q (multiple times)

Algorithm: Dyna-Q

Initialize: Q(s,a) = 0, Model(s,a) = empty for all s,a

Loop (episodes):
  Initialize S

  For each step:
    (a) A ← ε-greedy(Q, S)           ← choose action
    (b) Execute A; observe R, S'      ← real environment step
    (c) Q(S,A) ← Q(S,A) + α[R + γ max_{a'} Q(S',a') - Q(S,A)]  ← direct RL
    (d) Model(S,A) ← (R, S')         ← model learning (deterministic model)
    (e) Loop n times:                 ← planning
          s ← random previously-visited state
          a ← random action previously taken in s
          R̂, Ŝ' ← Model(s,a)         ← simulated experience
          Q(s,a) ← Q(s,a) + α[R̂ + γ max_{a''} Q(Ŝ',a'') - Q(s,a)]  ← planning update
    (f) S ← S'

The n planning steps (step e) are the key novelty. Each planning step updates Q using simulated (model-generated) experience — essentially running Q-learning on the model.

Dyna-Q Performance

Experiment (S&B Example 8.1 — Maze): - 47-state maze; goal in upper right; episode starts at lower left - Reward: +1 at goal, 0 elsewhere; no discounting

Comparison of Dyna-Q with different n:

n=0 (pure model-free Q-learning):  ~800 steps to first solution in episode 2
n=5 (5 planning steps per real step): ~400 steps
n=50 (50 planning steps):           ~80 steps

Interpretation: 50 planning steps effectively use each real experience 51 times (1 direct + 50 planning). The model “amplifies” each real experience.

Sample efficiency: Dyna-Q with n=50 reaches near-optimal policy 10× faster than pure Q-learning, using the same number of real environment steps.

Dyna-Q+: Exploration Bonus for Changing Environments

Problem: Dyna-Q doesn’t re-explore parts of the state space that it has “learned” — if the environment changes, the model becomes stale.

Dyna-Q+: adds an exploration bonus to actions that haven’t been taken recently:

Bonus reward: r + κ√τ    where τ = number of steps since (s,a) last tried
κ: exploration bonus constant (e.g., κ = 0.001)

This encourages revisiting old state-action pairs — detecting if the environment has changed.

Shortcut maze experiment (S&B Example 8.3): - Original maze: optimal path = 18 steps (top-left to goal) - At step 1000: shortcut opens (new 6-step path appears) - Standard Dyna-Q: continues using old 18-step path for long time - Dyna-Q+: discovers new shortcut within ~200 steps

Dyna-Q+ is more robust to nonstationary environments — important for DynamICCL where network topology and workloads change during training.

Prioritized Sweeping

Problem with random planning steps: wastes time updating states with small Bellman error.

Prioritized Sweeping: focus planning on states where the value will change the most.

Priority Queue PQ (ordered by max Bellman error)

Real step:
  Execute (S,A), observe (R, S')
  Update Q(S,A), model
  Compute P = |R + γ max Q(S',a') - Q(S,A)|
  If P > θ: insert (S,A) into PQ with priority P

Planning loop (n iterations):
  (s,a) ← PQ.pop_max()     ← highest priority (s,a)
  Update Q(s,a) from model
  For each (s̄, ā) that leads to s (predecessor pairs):
    Compute priority P̄ for (s̄, ā)
    If P̄ > θ: insert (s̄, ā) into PQ

Effect: focuses updates on “important” (s,a) pairs — those where the value is currently wrong.

Speedup: prioritized sweeping typically requires 5-10× fewer updates than random sweeping to converge.

Connection to deep RL: Prioritized Experience Replay (PER, Schaul et al., 2015) applies the same idea to DQN — samples transitions with large TD error more frequently.

When Does Planning Help?

Planning helps most when: 1. Real environment steps are expensive (robot interaction, GPU training jobs) 2. Simulated steps are cheap (simple model) 3. Model accuracy is high

Planning hurts when: 1. Model is inaccurate (model errors compound over many simulated steps) 2. Simulated steps are expensive (complex model, slow simulator) 3. Real environment has many random/nonstationary components

For DynamICCL: real environment steps (NCCL training steps) are very expensive. If a fast accurate model of network throughput could be built, planning would give massive sample efficiency gains — this is the “NCCL simulator” research direction.

Connection to DynamICCL

Dyna Concept	DynamICCL Analog
Real experience	NCCL performance measurements during training
Model M̂(s’	s,a)
Planning steps	Imagined NCCL rollouts using simulator
Dyna-Q	DreamerV3/MBPO if applied to NCCL tuning
Prioritized sweeping	PER in DQN-based NCCL optimizer
Dyna-Q+ bonus	UCB/entropy bonus in PPO for exploration

Current DynamICCL (model-free PPO) doesn’t use planning. Upgrading to Dyna-PPO (model-based) would dramatically reduce the number of real training jobs needed to learn a good policy — a key research direction.