Optimality, Approximation, and the Limits of Exact RL
Chapter 3 — Finite MDPs
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 62–72
The Optimality Gap and Approximate Solutions
Finding a truly optimal policy π* requires solving the Bellman optimality equations exactly — computationally feasible only for small MDPs.
In practice: - Chess: ~10^47 states → exact solution impossible - Go: ~10^170 states → exact solution impossible - DynamICCL: continuous bandwidth/topology features → uncountably many states
Real RL agents must approximate optimal behavior, accepting some performance loss in exchange for tractability.
Types of approximation: 1. Sample-based: only visit states that actually occur (Monte Carlo, TD) 2. Function approximation: use V(s; θ) ≈ V*(s) with parameters θ (neural nets, linear) 3. Policy search: directly search over policy space rather than value space
Memory and Computation Constraints
Even with small state spaces, the agent may face:
Memory constraint: a table Q(s, a) requires |S| × |A| entries. For |S| = 10^6, |A| = 100: 10^8 entries = 400 MB (for float32). For |S| = 10^9: infeasible without function approximation.
Computation constraint: must select actions quickly. Computing argmax_a Q(s, a) is O(|A|) — manageable. But policy iteration requires O(|S|² · |A|) per iteration.
Online constraint: the agent interacts with the environment in real time, spending most computation on states actually visited.
Three Approximation Strategies
1. Tabular Methods (Ch.4–8)
Store exact values in a table, but: - Only update states that are visited - Use sample-based methods to avoid needing the full model - Guarantees still hold for convergence
Limit: only works when |S|, |A| are small enough to fit in memory.
2. Linear Function Approximation (Ch.9–11)
Represent V(s; θ) = θ^T φ(s) where φ(s) is a feature vector. - Parameters: |θ| = d features (d << |S|) - Convergence proofs exist for linear approximation with TD
Limit: expressiveness bounded by feature design; may not capture complex value landscapes.
3. Deep Neural Network Approximation (DQN, PPO, SAC)
V(s; θ) = NeuralNet(s; θ) - Can approximate arbitrary functions - Universal approximation theorem guarantees expressiveness - No convergence guarantees; requires careful engineering
When does approximation fail? Deadly triad (Ch.11): - Off-policy learning + function approximation + bootstrapping can diverge
The Prediction vs. Control Distinction
Prediction problem: given a fixed policy π, estimate V^π or Q^π. - Solved by: iterative policy evaluation (Ch.4), MC prediction (Ch.5), TD(0) (Ch.6) - “How good is this policy?”
Control problem: find the optimal policy π*. - Solved by: value iteration (Ch.4), SARSA (Ch.6), Q-learning (Ch.6), PPO (Ch.13) - “What’s the best policy?”
Most RL algorithms alternate between prediction and control — this is Generalized Policy Iteration (GPI).
Generalized Policy Iteration (GPI)
The unifying concept of all RL:
GPI Loop:
┌─────────────────────────────────────────────────┐
│ │
│ Policy Evaluation: V_π → V^π │
│ ↕ (interleaved) │
│ Policy Improvement: π → greedy(V_π) │
│ │
└─────────────────────────────────────────────────┘
Policy improvement theorem: if π’ is the greedy policy w.r.t. V^π (i.e., π’(s) = argmax_a Q^π(s,a)), then V^{π’}(s) ≥ V^π(s) for all s.
Proof sketch:
V^π(s) ≤ Q^π(s, π'(s)) ← by greedy improvement
= E[R_{t+1} + γV^π(S_{t+1}) | S_t=s, A_t=π'(s)]
≤ E[R_{t+1} + γQ^π(S_{t+1}, π'(S_{t+1})) | ...]
= ...
≤ V^{π'}(s) ∎
GPI convergence: if each evaluation is sufficiently accurate and improvement is applied, the sequence of policies {π_0, π_1, …} converges to π. The two arrows (eval + improve) are “pulling in opposite directions” but their equilibrium is V and π*.
Value Iteration vs Policy Iteration
| Policy Iteration | Value Iteration | |
|---|---|---|
| Each iteration | Eval to convergence, then improve | One Bellman backup, implicit improve |
| Updates per policy | Many (until V converges) | One |
| Convergence | Finite ( | S |
| When to use | Small MDP, full evaluation affordable | Larger MDP, want fastest V* convergence |
Both converge to V* for finite MDPs; value iteration is a special case of policy iteration with truncated evaluation (1 step).
Optimality and the Real World
S&B’s important observation: exact optimality is almost never necessary or achievable.
Reasons: 1. The MDP model is always an approximation of reality (modeling error) 2. Reward function is designed by humans (may not perfectly capture true objective) 3. Computational constraints prevent finding V* exactly 4. Nonstationary environments: V* changes, so tracking it exactly wastes effort
Satisficing over optimizing: in practice, finding a policy that is “good enough” (e.g., 90% optimal) and does so robustly across a range of conditions is more valuable than finding π* for the assumed model.
This is especially true for DynamICCL: the “optimal” NCCL configuration for a fixed network state is less important than finding a robust configuration that performs well across diverse network states and workloads.
The Exploration-Exploitation Tradeoff in MDPs
In the bandit setting (Ch.2): exploitation vs exploration is a simple tradeoff.
In MDPs: exploration is structurally harder: 1. Credit assignment: a bad outcome may be due to actions taken many steps ago 2. State space coverage: must explore enough states to learn good policy everywhere 3. Distributional shift: as the policy improves, the visited state distribution changes
Common approaches: - ε-greedy: random actions with probability ε - Boltzmann/softmax: π(a|s) ∝ exp(Q(s,a)/τ) - UCB-style: exploration bonus added to Q(s,a) - Entropy regularization (PPO, SAC): add H[π(·|s)] to reward
Summary: Chapter 3 Key Takeaways
- MDPs formalize RL with state, action, reward, transition dynamics
- Markov property enables Bellman equations — value of state depends only on successor states
- V^π satisfies a linear system — unique solution exists for any fixed π
- V* satisfies the Bellman optimality equation — a nonlinear system, solved by value/policy iteration
- Q* is the key object for model-free RL — argmax_a Q*(s,a) gives optimal actions without knowing dynamics
- GPI is universal — all RL algorithms are instances of alternating evaluation and improvement
- Exact optimality is rarely needed — approximate methods are sufficient for practical problems
Connection to DynamICCL
| MDP Concept | DynamICCL Realization |
|---|---|
| State S | Network state: BW, topology, pending ops, algorithm history |
| Action A | NCCL params: algorithm, chunk size, pipeline depth |
| Reward R | Throughput improvement over baseline |
| γ = 0.99 | Long-horizon planning (job completion time matters) |
| Q*(s,a) | Optimal NCCL config for each network state |
| GPI | PPO: alternates between policy rollout (eval) and PPO update (improve) |
| Approximation | Neural network Q/V function for continuous state space |