Applications and Case Studies

Chapter 16 — Applications and Case Studies

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 395–414

Overview

Chapter 16 presents landmark RL applications demonstrating the generality and power of the methods developed in Part I and II.

TD-Gammon (Tesauro, 1992)

Problem: Learn to play backgammon at world-class level.

Approach: - Neural network V(board state; θ) — self-play TD learning - 3-layer MLP, ~80 hidden units - TD(λ) with λ=0.7; self-play for millions of games

Algorithm:

Initialize θ randomly
For each game (self-play):
  For each position s in game:
    V(s; θ) ← semi-gradient TD update toward V(s'; θ)
    (V(terminal) = 1 for win, 0 for loss)

Results: - After 300,000 games: expert level (top-10 human) - After 1.5M games: world-class (top-1 human players competitive) - Discovered novel strategic insights not known to human experts

Why it worked: - Self-play generates unlimited training data - TD learns to predict winning probability at each position - Neural network generalizes across the ~10^20 possible positions

Historical significance: first demonstration that RL + neural networks could reach superhuman performance on a complex game.

Samuel’s Checkers Player (1959)

Historical first: Arthur Samuel built a checkers-playing program in 1959 using a form of temporal difference learning (before the term was coined).

Method: - Polynomial feature-based value function V(s; θ) - Learned by playing against itself - “Rote learning” + “generalization learning” combined

Lesson: the core ideas of TD learning and self-play predate modern deep learning by 60 years. The scalability of neural networks was the missing ingredient.

Watson IBM Jeopardy! / Game Playing

G-TESAURO (TD-Gammon successor) and other board game systems demonstrate: 1. Self-play + TD learning scales to complex games 2. Feature engineering is important (or: neural networks replace it)

AlphaGo and AlphaZero

AlphaGo (Silver et al., 2016): - Supervised learning from human games → initialization - Policy gradient (PPO-like) + value function → self-play refinement - MCTS with learned policy prior + value network for tree expansion

AlphaZero (Silver et al., 2017): - No human games needed — pure self-play from random initialization - MCTS with UCB + neural policy/value - Achieves superhuman chess, Go, and shogi in hours

Key insight: MCTS + neural networks = perfect combination of planning (model-based, using game rules) and learning (value/policy from self-play).

Architecture:

s_t → [ResNet 20 layers] → policy head π(a|s) + value head V(s)

MCTS: N simulations
  - Select: max Q(s,a) + c P(a|s) √(N(s))/(1+N(s,a))
  - Evaluate: V(leaf) from value head
  - Backup: update Q, N along path

Real action: proportional to N(s,a)^{1/τ}

Relevance to DynamICCL: AlphaZero’s architecture (ResNet policy + value + MCTS) could be applied to NCCL if a fast simulator exists.

Elevator Dispatching

Problem: control k elevators in a building to minimize passenger waiting time.

State: floor positions of all elevators, waiting passengers per floor, time of day Action: for each elevator: go up, go down, open doors, wait Reward: negative waiting time

RL approach (Crites & Barto, 1996): - 2 elevators, 10 floors - Tabular Q-learning with discretized state - After 1 million steps: beats best heuristic by 15-25% in peak hours

Why RL outperformed heuristics: the optimal policy adapts dynamically to current system state — hard to capture with rule-based approaches.

Robot Locomotion (Raibert/Boston Dynamics)

Classic approach: hand-engineered control laws (PID + state machines).

RL approach (recent): - Policy gradient (PPO or SAC) for quadruped locomotion - Reward: forward velocity - energy cost - fall penalty - Sim-to-real transfer: train in simulation, transfer to robot

Key challenge: sim-to-real gap — simulated physics differs from real hardware.

Solutions: - Domain randomization: randomize simulation parameters → policy robust to variations - Residual policies: RL layer on top of base controller

Dynamic Channel Assignment (DCA)

Problem: assign channels to wireless calls to minimize blocking (calls that can’t connect due to no available channel).

State: pattern of channel assignments across base stations Action: assign/reassign a channel to a call Reward: -1 for each blocked call

RL approach (Singh & Bertsekas, 1997): - 7-cell cluster with 70 channels each - Neural TD learning - Results: 1-5% improvement over best heuristic (Fixed Channel Assignment)

Connection to DynamICCL: this is the closest historical precedent! DCA = assigning communication resources to improve network performance. NCCL optimization is essentially a generalized DCA for collective operations in distributed DL training.

Revenue Management

Problem: airline seat pricing — how to price seats to maximize revenue given uncertain demand.

RL approach: learn optimal price as a function of time-to-departure and available seats.

Results: RL-based dynamic pricing outperforms fixed price rules by 5-10%.

Chapter 16 Key Lessons

Self-play enables unlimited supervised training signal → TD-Gammon, AlphaZero
MCTS + neural networks = planning + learning; dominant for combinatorial games
RL outperforms heuristics in complex sequential decision problems
Simulation is critical for data generation when real-world interaction is expensive
Domain randomization bridges sim-to-real gap
Dynamic Channel Assignment is the historical precedent for DynamICCL

Connection to DynamICCL

Application	DynamICCL Analog
Elevator dispatching	NCCL scheduling (dynamic resource assignment)
DCA	Channel assignment → NCCL algorithm/buffer assignment
Robot locomotion	Multi-GPU training loop control
Revenue management	Dynamic bandwidth allocation → throughput maximization
AlphaZero self-play	Not directly applicable (no adversary in NCCL)
Sim-to-real transfer	Train NCCL agent in network simulator → deploy on real cluster

DynamICCL is directly in the tradition of the DCA work — applying RL to network resource allocation. The key advances: neural network function approximation + modern policy gradient methods.