Taxonomy of RL Solution Methods

The Big Picture

Sutton & Barto organize RL methods along several dimensions. Understanding this taxonomy lets you know where any algorithm fits and why it was designed the way it was.

Dimension 1: Model-Based vs. Model-Free

	Model-Based	Model-Free
Knows P(s’\|s,a), R(s,a)?	Yes (or learns it)	No
Examples	Dynamic Programming, Dyna, AlphaZero	Q-learning, SARSA, Policy Gradients
Pros	Sample efficient; can plan	No need to learn/know the model
Cons	Model errors propagate; expensive to learn	Less sample efficient
Ch.4	✓ DP
Ch.8	✓ Dyna (hybrid)

Dimension 2: Tabular vs. Approximate

	Tabular	Approximate
State space	Small, discrete	Large or continuous
Representation	Table V[s] or Q[s][a]	Function approximator (neural net, linear)
Examples	All of Part I	Part II (Ch.9–13)
Convergence	Guaranteed (for many methods)	More fragile

Dimension 3: On-Policy vs. Off-Policy

	On-Policy	Off-Policy
Learns V/Q for	Current behavior policy	Different target policy
Behavior = Target policy?	Yes	No
Examples	SARSA, REINFORCE, A2C	Q-learning, DQN, TD(0) with IS
Pros	Simpler, more stable	Can reuse experience; learn optimal while exploring
Cons	Less sample efficient	Variance from importance sampling

Dimension 4: Backup Depth (TD vs. MC vs. DP)

	DP	TD	Monte Carlo
Bootstrap?	Yes (uses V estimates)	Yes	No (uses actual returns)
Requires model?	Yes	No	No
Update after	Each step	Each step	Each episode
Bias	Low	Low	Zero (unbiased)
Variance	Low	Low	High

n-step methods (Ch.7) interpolate between TD (n=1) and MC (n=∞).

Dimension 5: Value vs. Policy vs. Actor-Critic

Approach	Learns	How to act	Chapter
Value-based	V(s) or Q(s,a)	Greedy w.r.t. Q	Ch.4–6, 9–11
Policy gradient	π(a\|s;θ) directly	Sample from π	Ch.13
Actor-Critic	Both V and π	Policy uses V as baseline	Ch.13 (A2C, PPO)

The Full Book Roadmap

Part I: Tabular Methods
├── Ch.2  Multi-Armed Bandits        (no state, pure exploration-exploitation)
├── Ch.3  Finite MDPs                (formal framework)
├── Ch.4  Dynamic Programming        (model-based, tabular, planning)
├── Ch.5  Monte Carlo Methods        (model-free, tabular, MC backup)
├── Ch.6  Temporal-Difference        (model-free, tabular, TD backup)
│   ├── TD(0): on-policy prediction
│   ├── SARSA: on-policy control
│   └── Q-learning: off-policy control
├── Ch.7  n-step Bootstrapping       (unified TD↔MC spectrum)
└── Ch.8  Planning + Learning        (Dyna: model-based + model-free hybrid)

Part II: Approximate Methods
├── Ch.9   On-policy prediction w/ approx  (value function approximation)
├── Ch.10  On-policy control w/ approx     (ε-greedy + function approx)
├── Ch.11  Off-policy w/ approx            (importance sampling, gradient TD)
├── Ch.12  Eligibility Traces              (TD(λ): unifies TD and MC)
└── Ch.13  Policy Gradient Methods        (REINFORCE, actor-critic, PPO)

Part III: Looking Deeper
├── Ch.14  Psychology
├── Ch.15  Neuroscience
├── Ch.16  Applications
└── Ch.17  Frontiers

Generalized Policy Iteration (GPI)

The unifying concept underlying all RL methods:

GPI:
  ┌─────────────────────────────────────────┐
  │  Policy Evaluation:  V → V^π           │
  │       ↕         (alternating)           │
  │  Policy Improvement: π → greedy(V)     │
  └─────────────────────────────────────────┘

Any algorithm that alternates between evaluating a policy and improving it is an instance of GPI. They differ only in:

How thoroughly each evaluation step is done (1 step = TD; full = DP; episode = MC)
What representation is used for V/Q (table vs. function approximator)

The Prediction-Control Distinction

Prediction problem: given a fixed policy π, estimate V^π or Q^π.

Ch.4: DP policy evaluation
Ch.5: MC prediction
Ch.6: TD(0) prediction
Ch.9: function approximation prediction

Control problem: find the optimal policy π*.

Ch.4: policy/value iteration
Ch.5: MC control (ε-greedy)
Ch.6: SARSA (on-policy), Q-learning (off-policy)
Ch.10-11: approximate control

Exploration Strategies Spectrum

Pure Exploitation ◄─────────────────────────────► Pure Exploration
     Greedy       ε-greedy    UCB     Boltzmann    Random
      (ε=0)    (ε=0.1)   (conf bound)  (T→∞)      (ε=1)

Principled exploration:
  UCB → count-based (Ch.2)
  Posterior sampling / Thompson sampling → Bayesian RL
  Intrinsic motivation (count-based, curiosity) → Ch.17

DynamICCL Placement in Taxonomy

DynamICCL’s RL component is:

Model-free (no pre-specified NCCL dynamics model)
Approximate (continuous/large state space)
On-policy (PPO) or off-policy (SAC)
Actor-Critic (policy gradient with value baseline)
Prediction: estimates V(state) = expected future throughput
Control: learns π(NCCL params | network state)