Taxonomy of RL Solution Methods
The Big Picture
Sutton & Barto organize RL methods along several dimensions. Understanding this taxonomy lets you know where any algorithm fits and why it was designed the way it was.
Dimension 1: Model-Based vs. Model-Free
| Model-Based | Model-Free | |
|---|---|---|
| Knows P(s’|s,a), R(s,a)? | Yes (or learns it) | No |
| Examples | Dynamic Programming, Dyna, AlphaZero | Q-learning, SARSA, Policy Gradients |
| Pros | Sample efficient; can plan | No need to learn/know the model |
| Cons | Model errors propagate; expensive to learn | Less sample efficient |
| Ch.4 | ✓ DP | |
| Ch.8 | ✓ Dyna (hybrid) |
Dimension 2: Tabular vs. Approximate
| Tabular | Approximate | |
|---|---|---|
| State space | Small, discrete | Large or continuous |
| Representation | Table V[s] or Q[s][a] | Function approximator (neural net, linear) |
| Examples | All of Part I | Part II (Ch.9–13) |
| Convergence | Guaranteed (for many methods) | More fragile |
Dimension 3: On-Policy vs. Off-Policy
| On-Policy | Off-Policy | |
|---|---|---|
| Learns V/Q for | Current behavior policy | Different target policy |
| Behavior = Target policy? | Yes | No |
| Examples | SARSA, REINFORCE, A2C | Q-learning, DQN, TD(0) with IS |
| Pros | Simpler, more stable | Can reuse experience; learn optimal while exploring |
| Cons | Less sample efficient | Variance from importance sampling |
Dimension 4: Backup Depth (TD vs. MC vs. DP)
| DP | TD | Monte Carlo | |
|---|---|---|---|
| Bootstrap? | Yes (uses V estimates) | Yes | No (uses actual returns) |
| Requires model? | Yes | No | No |
| Update after | Each step | Each step | Each episode |
| Bias | Low | Low | Zero (unbiased) |
| Variance | Low | Low | High |
n-step methods (Ch.7) interpolate between TD (n=1) and MC (n=∞).
Dimension 5: Value vs. Policy vs. Actor-Critic
| Approach | Learns | How to act | Chapter |
|---|---|---|---|
| Value-based | V(s) or Q(s,a) | Greedy w.r.t. Q | Ch.4–6, 9–11 |
| Policy gradient | π(a|s;θ) directly | Sample from π | Ch.13 |
| Actor-Critic | Both V and π | Policy uses V as baseline | Ch.13 (A2C, PPO) |
The Full Book Roadmap
Part I: Tabular Methods
├── Ch.2 Multi-Armed Bandits (no state, pure exploration-exploitation)
├── Ch.3 Finite MDPs (formal framework)
├── Ch.4 Dynamic Programming (model-based, tabular, planning)
├── Ch.5 Monte Carlo Methods (model-free, tabular, MC backup)
├── Ch.6 Temporal-Difference (model-free, tabular, TD backup)
│ ├── TD(0): on-policy prediction
│ ├── SARSA: on-policy control
│ └── Q-learning: off-policy control
├── Ch.7 n-step Bootstrapping (unified TD↔MC spectrum)
└── Ch.8 Planning + Learning (Dyna: model-based + model-free hybrid)
Part II: Approximate Methods
├── Ch.9 On-policy prediction w/ approx (value function approximation)
├── Ch.10 On-policy control w/ approx (ε-greedy + function approx)
├── Ch.11 Off-policy w/ approx (importance sampling, gradient TD)
├── Ch.12 Eligibility Traces (TD(λ): unifies TD and MC)
└── Ch.13 Policy Gradient Methods (REINFORCE, actor-critic, PPO)
Part III: Looking Deeper
├── Ch.14 Psychology
├── Ch.15 Neuroscience
├── Ch.16 Applications
└── Ch.17 Frontiers
Generalized Policy Iteration (GPI)
The unifying concept underlying all RL methods:
GPI:
┌─────────────────────────────────────────┐
│ Policy Evaluation: V → V^π │
│ ↕ (alternating) │
│ Policy Improvement: π → greedy(V) │
└─────────────────────────────────────────┘
Any algorithm that alternates between evaluating a policy and improving it is an instance of GPI. They differ only in:
- How thoroughly each evaluation step is done (1 step = TD; full = DP; episode = MC)
- What representation is used for V/Q (table vs. function approximator)
The Prediction-Control Distinction
Prediction problem: given a fixed policy π, estimate V^π or Q^π.
- Ch.4: DP policy evaluation
- Ch.5: MC prediction
- Ch.6: TD(0) prediction
- Ch.9: function approximation prediction
Control problem: find the optimal policy π*.
- Ch.4: policy/value iteration
- Ch.5: MC control (ε-greedy)
- Ch.6: SARSA (on-policy), Q-learning (off-policy)
- Ch.10-11: approximate control
Exploration Strategies Spectrum
Pure Exploitation ◄─────────────────────────────► Pure Exploration
Greedy ε-greedy UCB Boltzmann Random
(ε=0) (ε=0.1) (conf bound) (T→∞) (ε=1)
Principled exploration:
UCB → count-based (Ch.2)
Posterior sampling / Thompson sampling → Bayesian RL
Intrinsic motivation (count-based, curiosity) → Ch.17
DynamICCL Placement in Taxonomy
DynamICCL’s RL component is:
- Model-free (no pre-specified NCCL dynamics model)
- Approximate (continuous/large state space)
- On-policy (PPO) or off-policy (SAC)
- Actor-Critic (policy gradient with value baseline)
- Prediction: estimates V(state) = expected future throughput
- Control: learns π(NCCL params | network state)