Taxonomy of RL Solution Methods

The Big Picture

Sutton & Barto organize RL methods along several dimensions. Understanding this taxonomy lets you know where any algorithm fits and why it was designed the way it was.


Dimension 1: Model-Based vs. Model-Free

Model-Based Model-Free
Knows P(s’|s,a), R(s,a)? Yes (or learns it) No
Examples Dynamic Programming, Dyna, AlphaZero Q-learning, SARSA, Policy Gradients
Pros Sample efficient; can plan No need to learn/know the model
Cons Model errors propagate; expensive to learn Less sample efficient
Ch.4 ✓ DP
Ch.8 ✓ Dyna (hybrid)

Dimension 2: Tabular vs. Approximate

Tabular Approximate
State space Small, discrete Large or continuous
Representation Table V[s] or Q[s][a] Function approximator (neural net, linear)
Examples All of Part I Part II (Ch.9–13)
Convergence Guaranteed (for many methods) More fragile

Dimension 3: On-Policy vs. Off-Policy

On-Policy Off-Policy
Learns V/Q for Current behavior policy Different target policy
Behavior = Target policy? Yes No
Examples SARSA, REINFORCE, A2C Q-learning, DQN, TD(0) with IS
Pros Simpler, more stable Can reuse experience; learn optimal while exploring
Cons Less sample efficient Variance from importance sampling

Dimension 4: Backup Depth (TD vs. MC vs. DP)

DP TD Monte Carlo
Bootstrap? Yes (uses V estimates) Yes No (uses actual returns)
Requires model? Yes No No
Update after Each step Each step Each episode
Bias Low Low Zero (unbiased)
Variance Low Low High

n-step methods (Ch.7) interpolate between TD (n=1) and MC (n=∞).


Dimension 5: Value vs. Policy vs. Actor-Critic

Approach Learns How to act Chapter
Value-based V(s) or Q(s,a) Greedy w.r.t. Q Ch.4–6, 9–11
Policy gradient π(a|s;θ) directly Sample from π Ch.13
Actor-Critic Both V and π Policy uses V as baseline Ch.13 (A2C, PPO)

The Full Book Roadmap

Part I: Tabular Methods
├── Ch.2  Multi-Armed Bandits        (no state, pure exploration-exploitation)
├── Ch.3  Finite MDPs                (formal framework)
├── Ch.4  Dynamic Programming        (model-based, tabular, planning)
├── Ch.5  Monte Carlo Methods        (model-free, tabular, MC backup)
├── Ch.6  Temporal-Difference        (model-free, tabular, TD backup)
│   ├── TD(0): on-policy prediction
│   ├── SARSA: on-policy control
│   └── Q-learning: off-policy control
├── Ch.7  n-step Bootstrapping       (unified TD↔MC spectrum)
└── Ch.8  Planning + Learning        (Dyna: model-based + model-free hybrid)

Part II: Approximate Methods
├── Ch.9   On-policy prediction w/ approx  (value function approximation)
├── Ch.10  On-policy control w/ approx     (ε-greedy + function approx)
├── Ch.11  Off-policy w/ approx            (importance sampling, gradient TD)
├── Ch.12  Eligibility Traces              (TD(λ): unifies TD and MC)
└── Ch.13  Policy Gradient Methods        (REINFORCE, actor-critic, PPO)

Part III: Looking Deeper
├── Ch.14  Psychology
├── Ch.15  Neuroscience
├── Ch.16  Applications
└── Ch.17  Frontiers

Generalized Policy Iteration (GPI)

The unifying concept underlying all RL methods:

GPI:
  ┌─────────────────────────────────────────┐
  │  Policy Evaluation:  V → V^π           │
  │       ↕         (alternating)           │
  │  Policy Improvement: π → greedy(V)     │
  └─────────────────────────────────────────┘

Any algorithm that alternates between evaluating a policy and improving it is an instance of GPI. They differ only in:

  1. How thoroughly each evaluation step is done (1 step = TD; full = DP; episode = MC)
  2. What representation is used for V/Q (table vs. function approximator)

The Prediction-Control Distinction

Prediction problem: given a fixed policy π, estimate V^π or Q^π.

Control problem: find the optimal policy π*.


Exploration Strategies Spectrum

Pure Exploitation ◄─────────────────────────────► Pure Exploration
     Greedy       ε-greedy    UCB     Boltzmann    Random
      (ε=0)    (ε=0.1)   (conf bound)  (T→∞)      (ε=1)

Principled exploration:
  UCB → count-based (Ch.2)
  Posterior sampling / Thompson sampling → Bayesian RL
  Intrinsic motivation (count-based, curiosity) → Ch.17

DynamICCL Placement in Taxonomy

DynamICCL’s RL component is: