Optimality, Approximation, and the Limits of Exact RL

Chapter 3 — Finite MDPs

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 62–72


The Optimality Gap and Approximate Solutions

Finding a truly optimal policy π* requires solving the Bellman optimality equations exactly — computationally feasible only for small MDPs.

In practice: - Chess: ~10^47 states → exact solution impossible - Go: ~10^170 states → exact solution impossible - DynamICCL: continuous bandwidth/topology features → uncountably many states

Real RL agents must approximate optimal behavior, accepting some performance loss in exchange for tractability.

Types of approximation: 1. Sample-based: only visit states that actually occur (Monte Carlo, TD) 2. Function approximation: use V(s; θ) ≈ V*(s) with parameters θ (neural nets, linear) 3. Policy search: directly search over policy space rather than value space


Memory and Computation Constraints

Even with small state spaces, the agent may face:

Memory constraint: a table Q(s, a) requires |S| × |A| entries. For |S| = 10^6, |A| = 100: 10^8 entries = 400 MB (for float32). For |S| = 10^9: infeasible without function approximation.

Computation constraint: must select actions quickly. Computing argmax_a Q(s, a) is O(|A|) — manageable. But policy iteration requires O(|S|² · |A|) per iteration.

Online constraint: the agent interacts with the environment in real time, spending most computation on states actually visited.


Three Approximation Strategies

1. Tabular Methods (Ch.4–8)

Store exact values in a table, but: - Only update states that are visited - Use sample-based methods to avoid needing the full model - Guarantees still hold for convergence

Limit: only works when |S|, |A| are small enough to fit in memory.

2. Linear Function Approximation (Ch.9–11)

Represent V(s; θ) = θ^T φ(s) where φ(s) is a feature vector. - Parameters: |θ| = d features (d << |S|) - Convergence proofs exist for linear approximation with TD

Limit: expressiveness bounded by feature design; may not capture complex value landscapes.

3. Deep Neural Network Approximation (DQN, PPO, SAC)

V(s; θ) = NeuralNet(s; θ) - Can approximate arbitrary functions - Universal approximation theorem guarantees expressiveness - No convergence guarantees; requires careful engineering

When does approximation fail? Deadly triad (Ch.11): - Off-policy learning + function approximation + bootstrapping can diverge


The Prediction vs. Control Distinction

Prediction problem: given a fixed policy π, estimate V^π or Q^π. - Solved by: iterative policy evaluation (Ch.4), MC prediction (Ch.5), TD(0) (Ch.6) - “How good is this policy?”

Control problem: find the optimal policy π*. - Solved by: value iteration (Ch.4), SARSA (Ch.6), Q-learning (Ch.6), PPO (Ch.13) - “What’s the best policy?”

Most RL algorithms alternate between prediction and control — this is Generalized Policy Iteration (GPI).


Generalized Policy Iteration (GPI)

The unifying concept of all RL:

GPI Loop:
  ┌─────────────────────────────────────────────────┐
  │                                                  │
  │   Policy Evaluation:   V_π → V^π               │
  │         ↕  (interleaved)                        │
  │   Policy Improvement:  π → greedy(V_π)          │
  │                                                  │
  └─────────────────────────────────────────────────┘

Policy improvement theorem: if π’ is the greedy policy w.r.t. V^π (i.e., π’(s) = argmax_a Q^π(s,a)), then V^{π’}(s) ≥ V^π(s) for all s.

Proof sketch:

V^π(s) ≤ Q^π(s, π'(s))                    ← by greedy improvement
       = E[R_{t+1} + γV^π(S_{t+1}) | S_t=s, A_t=π'(s)]
       ≤ E[R_{t+1} + γQ^π(S_{t+1}, π'(S_{t+1})) | ...]
       = ...
       ≤ V^{π'}(s)                         ∎

GPI convergence: if each evaluation is sufficiently accurate and improvement is applied, the sequence of policies {π_0, π_1, …} converges to π. The two arrows (eval + improve) are “pulling in opposite directions” but their equilibrium is V and π*.


Value Iteration vs Policy Iteration

Policy Iteration Value Iteration
Each iteration Eval to convergence, then improve One Bellman backup, implicit improve
Updates per policy Many (until V converges) One
Convergence Finite ( S
When to use Small MDP, full evaluation affordable Larger MDP, want fastest V* convergence

Both converge to V* for finite MDPs; value iteration is a special case of policy iteration with truncated evaluation (1 step).


Optimality and the Real World

S&B’s important observation: exact optimality is almost never necessary or achievable.

Reasons: 1. The MDP model is always an approximation of reality (modeling error) 2. Reward function is designed by humans (may not perfectly capture true objective) 3. Computational constraints prevent finding V* exactly 4. Nonstationary environments: V* changes, so tracking it exactly wastes effort

Satisficing over optimizing: in practice, finding a policy that is “good enough” (e.g., 90% optimal) and does so robustly across a range of conditions is more valuable than finding π* for the assumed model.

This is especially true for DynamICCL: the “optimal” NCCL configuration for a fixed network state is less important than finding a robust configuration that performs well across diverse network states and workloads.


The Exploration-Exploitation Tradeoff in MDPs

In the bandit setting (Ch.2): exploitation vs exploration is a simple tradeoff.

In MDPs: exploration is structurally harder: 1. Credit assignment: a bad outcome may be due to actions taken many steps ago 2. State space coverage: must explore enough states to learn good policy everywhere 3. Distributional shift: as the policy improves, the visited state distribution changes

Common approaches: - ε-greedy: random actions with probability ε - Boltzmann/softmax: π(a|s) ∝ exp(Q(s,a)/τ) - UCB-style: exploration bonus added to Q(s,a) - Entropy regularization (PPO, SAC): add H[π(·|s)] to reward


Summary: Chapter 3 Key Takeaways

  1. MDPs formalize RL with state, action, reward, transition dynamics
  2. Markov property enables Bellman equations — value of state depends only on successor states
  3. V^π satisfies a linear system — unique solution exists for any fixed π
  4. V* satisfies the Bellman optimality equation — a nonlinear system, solved by value/policy iteration
  5. Q* is the key object for model-free RL — argmax_a Q*(s,a) gives optimal actions without knowing dynamics
  6. GPI is universal — all RL algorithms are instances of alternating evaluation and improvement
  7. Exact optimality is rarely needed — approximate methods are sufficient for practical problems

Connection to DynamICCL

MDP Concept DynamICCL Realization
State S Network state: BW, topology, pending ops, algorithm history
Action A NCCL params: algorithm, chunk size, pipeline depth
Reward R Throughput improvement over baseline
γ = 0.99 Long-horizon planning (job completion time matters)
Q*(s,a) Optimal NCCL config for each network state
GPI PPO: alternates between policy rollout (eval) and PPO update (improve)
Approximation Neural network Q/V function for continuous state space