Frontiers of Reinforcement Learning

Chapter 17 — Frontiers

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 415–444

Overview

Chapter 17 surveys the major open frontiers in RL research, pointing toward the next generation of methods. These are the research directions where DynamICCL’s future development should draw from.

1. General Value Functions (GVFs) and Predictive Knowledge

Standard RL: learn one value function V^π(s) = E[Σ γ^k R_{t+k+1} | S_t=s].

GVFs: learn many value functions with different pseudo-reward functions, discount factors, and policies:

V^{π,γ,c}(s) = E_π[Σ_{k=0}^∞ γ_k · c(S_{t+k+1}, A_{t+k+1}) | S_t=s]

where c = “cumulant” (arbitrary scalar signal, not just extrinsic reward).

Why GVFs? The agent builds a rich predictive model of the world by learning many predictions simultaneously: - “How many steps until I leave this room?” - “What will my energy level be in 10 steps?” - “Will the communication buffer overflow in the next minute?”

Horde architecture: many “demons” — agents each learning a GVF in parallel using off-policy TD. Used in robotics for building rich world models.

Relevance to DynamICCL: DynamICCL could learn GVFs predicting: - Future bandwidth utilization (helps predict congestion) - Time until job checkpoint - Expected collective operations in next N steps

2. Temporal Abstraction: Options and Hierarchical RL

Problem with flat RL: must make a decision at every time step → difficult credit assignment for long-horizon rewards.

Options framework (Sutton, Precup & Singh, 1999): macro-actions that execute for multiple steps.

Option definition: (π_o, β_o, I_o) - π_o: option policy (what to do while executing option) - β_o(s): termination probability in state s - I_o: initiation set (states where option can be started)

Semi-Markov Decision Process: transition between options creates an MDP over options.

Hierarchy: options can themselves invoke sub-options → hierarchical RL.

Examples: - “Walk to the kitchen” option composed of atomic actions - NCCL: “execute ring allreduce for this batch” as an option

MAXQ (Dietterich, 2000): factored value decomposition:

V(s, a) = V_children(s, a) + V_completion(s, parent)

Subtask completion values are learned independently → modular, reusable.

3. Exploration: Count-Based and Curiosity-Driven

Classic exploration (ε-greedy, UCB): sufficient for small state spaces.

Large state spaces: need generalization in exploration. Two main approaches:

Count-Based Exploration

Maintain pseudo-counts n̂(s) from density model ρ(s):

n̂(s) ≈ ρ(s) × total_steps    ← estimated count from density model
r_bonus(s) = β / √n̂(s)        ← exploration bonus

When using neural density models (Bellemare et al., 2016): works for Atari-level complexity.

Curiosity / Intrinsic Motivation

ICM (Intrinsic Curiosity Module, Pathak et al., 2017):

Forward model: ŝ_{t+1} = f(s_t, a_t; θ_F)    ← predict next state
r_intrinsic = ||ŝ_{t+1} - s_{t+1}||²         ← surprise = intrinsic reward

Rewards the agent for visiting states it can’t yet predict — drives exploration.

RND (Random Network Distillation, Burda et al., 2018):

Fixed random network: e(s) = random_network(s)
Learned predictor:    ê(s; φ) = predictor(s; φ)
r_intrinsic = ||e(s) - ê(s)||²    ← prediction error

Simpler, more stable than forward model prediction.

NCCL application: intrinsic motivation bonus for trying rarely-used NCCL configurations — encourages thorough exploration of the configuration space.

4. Learning with Human Feedback (RLHF)

RLHF (Ziegler et al., 2019; Ouyang et al., 2022 [InstructGPT]): 1. Collect human preference data: show two outputs, ask which is better 2. Train reward model r_φ(s, a) from preferences 3. Optimize LM policy with PPO to maximize r_φ

Relevance: the “reward model” extends RL to settings where reward is hard to specify programmatically but humans can evaluate outcomes.

DynamICCL extension: human experts could label “good” vs “bad” NCCL configuration decisions based on system behavior; reward model trained from these labels; PPO optimizes NCCL policy to maximize reward model output.

5. Multi-Task and Transfer Learning

Problem: learning a new task from scratch is slow. Can the agent reuse knowledge from related tasks?

Transfer approaches: 1. Fine-tuning: train on source task, fine-tune on target 2. Universal value functions (UVFs): V(s, g; θ) conditioned on goal g 3. Meta-RL: learn to learn — inner loop (task-specific), outer loop (meta-parameters) 4. MAML (Model-Agnostic Meta-Learning): find θ that can be fine-tuned quickly

For DynamICCL: different models (ResNet, Transformer, BERT) have different communication patterns. Transfer learning from one model to another would dramatically reduce NCCL training time.

6. Multi-Agent RL (MARL)

Standard RL: single agent. Real systems: multiple agents interacting.

Cooperative MARL (decentralized execution, centralized training): - Each agent has own observation o_i, action a_i - Central critic conditions on all agents’ information during training - Agents act independently at test time

Algorithms: - MAPPO (Multi-Agent PPO): independent learners sharing policy - QMIX: factored Q-function Q_joint = f(Q_1, …, Q_n) - MADDPG: deterministic policy gradient with centralized critic

NCCL relevance: in a multi-GPU distributed training setup, each GPU is an agent choosing local NCCL parameters. Cooperative MARL could coordinate them.

7. Offline / Batch RL

Standard RL: agent must interact with environment to collect data.

Offline RL: fixed dataset D = {(s,a,r,s’)} from past experience; learn a good policy without further environment interaction.

Challenge: distribution shift — optimal policy may take actions not well-represented in D.

Algorithms: - BCQ (Batch-Constrained Q-learning): constrain policy to actions in dataset - CQL (Conservative Q-Learning): lower-bound Q-values to prevent overestimation - IQL (Implicit Q-Learning): avoids querying out-of-distribution actions

DynamICCL application: offline RL from historical cluster logs. Could learn a good NCCL policy from logs of thousands of past training jobs without running new experiments.

8. Model-Based RL: World Models

World model (Ha & Schmidhuber, 2018; DreamerV1-3):

Encoder:    z_t = enc(o_t)          ← observation → latent state
Dynamics:   z_{t+1} ~ p(z|z_t, a_t) ← transition in latent space
Reward:     r̂_t = rew(z_t, a_t)
Decode:     ô_t = dec(z_t)          ← for visualization/auxiliary loss

Policy learning entirely in imagination: 1. Collect some real experience 2. Train world model on real experience 3. Generate “imagined” trajectories using world model 4. Train policy on imagined trajectories 5. Repeat

DreamerV3 achieves state-of-the-art on many environments with only a few real interactions.

9. Reward Design and Inverse RL

Reward design challenge: specifying a reward that captures the true objective without reward hacking.

Inverse RL (IRL): infer the reward function from expert demonstrations:

Given: expert trajectories τ_1, ..., τ_n
Find: reward function R* such that expert behavior is optimal under R*

Maximum entropy IRL (Ziebart et al., 2008):

R* = argmax_R Σ_i log P_R(τ_i) - λ ||R||²
P_R(τ) ∝ exp(Σ_t R(s_t, a_t))    ← Boltzmann distribution over trajectories

GAIL (Generative Adversarial Imitation Learning): combine IRL with GANs: - Discriminator D(s,a): distinguishes expert from learned policy - Generator (policy): tries to fool D - Reward = log D(s,a)

NCCL application: infer reward function from expert human engineers’ NCCL configuration choices.

Chapter 17 Key Frontiers Summary

Frontier	Description	DynamICCL Relevance
GVFs / Predictive knowledge	Many predictions simultaneously	Predict future network state
Options / HRL	Temporal abstraction	Macro-NCCL actions per job phase
Count-based / curiosity exploration	Generalized exploration	Novel NCCL config exploration
RLHF	Human preference reward	Expert NCCL preference labeling
Multi-task / meta-RL	Transfer across tasks	Transfer across model architectures
MARL	Multiple cooperating agents	Multi-GPU cooperative NCCL
Offline RL	Learn from historical logs	Historical cluster log training
World models	Plan in simulation	NCCL network simulator
Inverse RL	Infer reward from experts	Learn NCCL objective from engineers

The Grand Challenge

Sutton’s bitter lesson (2019): the biggest lesson of 70 years of AI research is that methods that leverage computation will always win over methods that leverage human knowledge. RL + deep learning = scalable, general intelligence.

Reward is enough (Silver et al., 2021): the reward hypothesis extended — intelligence, including perception, knowledge, planning, social intelligence — can be understood as the emergence of maximizing a single scalar reward signal. DynamICCL is a microcosm: complex NCCL optimization emerges from maximizing throughput reward.

Connection to DynamICCL: Future Research Directions

World model: build neural network NCCL simulator → DreamerV3-style training
Options: hierarchical control (job-level NCCL strategy → step-level execution)
Offline RL: mine historical cluster logs for cold-start policy
MARL: multi-GPU cooperative NCCL optimization
Curiosity: intrinsic reward for novel NCCL configs → thorough exploration
GVFs: predict congestion, memory pressure, collective operation density