Applications and Case Studies
Chapter 16 — Applications and Case Studies
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 395–414
Overview
Chapter 16 presents landmark RL applications demonstrating the generality and power of the methods developed in Part I and II.
TD-Gammon (Tesauro, 1992)
Problem: Learn to play backgammon at world-class level.
Approach: - Neural network V(board state; θ) — self-play TD learning - 3-layer MLP, ~80 hidden units - TD(λ) with λ=0.7; self-play for millions of games
Algorithm:
Initialize θ randomly
For each game (self-play):
For each position s in game:
V(s; θ) ← semi-gradient TD update toward V(s'; θ)
(V(terminal) = 1 for win, 0 for loss)
Results: - After 300,000 games: expert level (top-10 human) - After 1.5M games: world-class (top-1 human players competitive) - Discovered novel strategic insights not known to human experts
Why it worked: - Self-play generates unlimited training data - TD learns to predict winning probability at each position - Neural network generalizes across the ~10^20 possible positions
Historical significance: first demonstration that RL + neural networks could reach superhuman performance on a complex game.
Samuel’s Checkers Player (1959)
Historical first: Arthur Samuel built a checkers-playing program in 1959 using a form of temporal difference learning (before the term was coined).
Method: - Polynomial feature-based value function V(s; θ) - Learned by playing against itself - “Rote learning” + “generalization learning” combined
Lesson: the core ideas of TD learning and self-play predate modern deep learning by 60 years. The scalability of neural networks was the missing ingredient.
Watson IBM Jeopardy! / Game Playing
G-TESAURO (TD-Gammon successor) and other board game systems demonstrate: 1. Self-play + TD learning scales to complex games 2. Feature engineering is important (or: neural networks replace it)
AlphaGo and AlphaZero
AlphaGo (Silver et al., 2016): - Supervised learning from human games → initialization - Policy gradient (PPO-like) + value function → self-play refinement - MCTS with learned policy prior + value network for tree expansion
AlphaZero (Silver et al., 2017): - No human games needed — pure self-play from random initialization - MCTS with UCB + neural policy/value - Achieves superhuman chess, Go, and shogi in hours
Key insight: MCTS + neural networks = perfect combination of planning (model-based, using game rules) and learning (value/policy from self-play).
Architecture:
s_t → [ResNet 20 layers] → policy head π(a|s) + value head V(s)
MCTS: N simulations
- Select: max Q(s,a) + c P(a|s) √(N(s))/(1+N(s,a))
- Evaluate: V(leaf) from value head
- Backup: update Q, N along path
Real action: proportional to N(s,a)^{1/τ}
Relevance to DynamICCL: AlphaZero’s architecture (ResNet policy + value + MCTS) could be applied to NCCL if a fast simulator exists.
Elevator Dispatching
Problem: control k elevators in a building to minimize passenger waiting time.
State: floor positions of all elevators, waiting passengers per floor, time of day Action: for each elevator: go up, go down, open doors, wait Reward: negative waiting time
RL approach (Crites & Barto, 1996): - 2 elevators, 10 floors - Tabular Q-learning with discretized state - After 1 million steps: beats best heuristic by 15-25% in peak hours
Why RL outperformed heuristics: the optimal policy adapts dynamically to current system state — hard to capture with rule-based approaches.
Robot Locomotion (Raibert/Boston Dynamics)
Classic approach: hand-engineered control laws (PID + state machines).
RL approach (recent): - Policy gradient (PPO or SAC) for quadruped locomotion - Reward: forward velocity - energy cost - fall penalty - Sim-to-real transfer: train in simulation, transfer to robot
Key challenge: sim-to-real gap — simulated physics differs from real hardware.
Solutions: - Domain randomization: randomize simulation parameters → policy robust to variations - Residual policies: RL layer on top of base controller
Dynamic Channel Assignment (DCA)
Problem: assign channels to wireless calls to minimize blocking (calls that can’t connect due to no available channel).
State: pattern of channel assignments across base stations Action: assign/reassign a channel to a call Reward: -1 for each blocked call
RL approach (Singh & Bertsekas, 1997): - 7-cell cluster with 70 channels each - Neural TD learning - Results: 1-5% improvement over best heuristic (Fixed Channel Assignment)
Connection to DynamICCL: this is the closest historical precedent! DCA = assigning communication resources to improve network performance. NCCL optimization is essentially a generalized DCA for collective operations in distributed DL training.
Revenue Management
Problem: airline seat pricing — how to price seats to maximize revenue given uncertain demand.
RL approach: learn optimal price as a function of time-to-departure and available seats.
Results: RL-based dynamic pricing outperforms fixed price rules by 5-10%.
Chapter 16 Key Lessons
- Self-play enables unlimited supervised training signal → TD-Gammon, AlphaZero
- MCTS + neural networks = planning + learning; dominant for combinatorial games
- RL outperforms heuristics in complex sequential decision problems
- Simulation is critical for data generation when real-world interaction is expensive
- Domain randomization bridges sim-to-real gap
- Dynamic Channel Assignment is the historical precedent for DynamICCL
Connection to DynamICCL
| Application | DynamICCL Analog |
|---|---|
| Elevator dispatching | NCCL scheduling (dynamic resource assignment) |
| DCA | Channel assignment → NCCL algorithm/buffer assignment |
| Robot locomotion | Multi-GPU training loop control |
| Revenue management | Dynamic bandwidth allocation → throughput maximization |
| AlphaZero self-play | Not directly applicable (no adversary in NCCL) |
| Sim-to-real transfer | Train NCCL agent in network simulator → deploy on real cluster |
DynamICCL is directly in the tradition of the DCA work — applying RL to network resource allocation. The key advances: neural network function approximation + modern policy gradient methods.