Reinforcement Learning Fundamentals
Chapter 22 — Reinforcement Learning Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 810–867
The RL Problem
An agent interacts with an environment, receiving rewards, without access to the environment model. It must learn from experience to maximize cumulative reward.
Key distinction from supervised learning: no labeled examples — only reward signals (often sparse, delayed).
Key distinction from planning (MDPs): transition model P(s’|s,a) and reward R(s,a) are unknown — must be learned from experience.
RL Framework
Agent-environment interaction:
State s_t → Action a_t → Reward r_t + Next state s_{t+1} → ...
Goal: find policy π that maximizes:
G_t = Σ_{k=0}^∞ γᵏ · r_{t+k+1} -- return (discounted sum of future rewards)
Value functions (predictions of future reward):
V^π(s) = E_π[G_t | S_t=s]
Q^π(s,a) = E_π[G_t | S_t=s, A_t=a]
Passive RL: Policy Evaluation
Agent follows a fixed policy π and learns V^π from experience.
Temporal Difference (TD) Learning
V(s) ← V(s) + α [r + γ·V(s') - V(s)]
TD target: r + γ·V(s’) — one-step bootstrapped estimate TD error: δ = r + γ·V(s’) - V(s)
TD(0) updates immediately after each step. Converges to V^π with decreasing α.
Monte Carlo: update only at end of episode with actual return G_t. Higher variance, unbiased.
TD(λ): interpolates between TD(0) and MC using eligibility traces.
Active RL: Policy Improvement
Agent must also choose actions to maximize reward.
Exploration vs. Exploitation Tradeoff
- Exploit: choose action with highest known value
- Explore: try actions to learn better values
ε-greedy: with prob ε choose random action; else choose argmax Q.
Q-Learning (Off-Policy TD Control)
Q(s,a) ← Q(s,a) + α [r + γ·max_{a'} Q(s',a') - Q(s,a)]
Off-policy: target policy = greedy (best known), behavior policy = ε-greedy.
Convergence: to Q* with probability 1 if: 1. Every (s,a) pair visited infinitely often 2. Learning rates α satisfy Σα=∞, Σα²<∞
Proof sketch: Q-learning update = contraction mapping toward Q*; same as value iteration but with samples.
SARSA (On-Policy TD Control)
Q(s,a) ← Q(s,a) + α [r + γ·Q(s',a') - Q(s,a)]
Where a’ is the action actually taken (sampled from current ε-greedy policy).
On-policy: learns Q^π (value of the current policy, not the greedy).
Q-learning is more aggressive (can be unsafe); SARSA is safer (learns to avoid dangerous actions).
Deep Q-Networks (DQN)
Q-table doesn’t scale to large state spaces. Approximate Q(s,a) with a neural network:
Q(s,a; θ) ≈ Q*(s,a)
DQN innovations (Mnih et al., 2015): 1. Experience replay: store transitions in replay buffer; sample random minibatches - Breaks temporal correlations → stabilizes training 2. Target network: separate θ⁻ for targets; update slowly - Prevents chasing a moving target
Loss: L(θ) = E_{(s,a,r,s')~D}[(r + γ·max_{a'} Q(s',a';θ⁻) - Q(s,a;θ))²]
DQN achieved human-level performance on 49 Atari games from raw pixels.
Policy Gradient Methods
Instead of learning V/Q and deriving policy, directly optimize policy parameters θ.
Policy gradient theorem:
∇_θ J(θ) = E_π[∇_θ log π(a|s;θ) · Q^π(s,a)]
REINFORCE (Monte Carlo policy gradient):
θ ← θ + α · G_t · ∇_θ log π(a_t|s_t;θ)
High variance — reduce with baseline b(s):
θ ← θ + α · (G_t - b(s_t)) · ∇_θ log π(a_t|s_t;θ)
Actor-Critic Methods
Combine policy gradient (actor) with value function approximation (critic):
Actor: θ ← θ + α · δ · ∇_θ log π(a|s;θ)
Critic: w ← w + β · δ · ∇_w V(s;w)
TD error: δ = r + γ·V(s';w) - V(s;w)
Advantage function: A(s,a) = Q(s,a) - V(s) — reduces variance vs. raw Q.
Modern variants: - A3C/A2C: asynchronous parallel
actors - PPO (Proximal Policy Optimization): clipped
ratio objective → stable training
L_CLIP = E[min(rₜ·Â, clip(rₜ, 1-ε, 1+ε)·Â)] where rₜ = π_new/π_old
- SAC (Soft Actor-Critic): maximum entropy RL —
maximize E[R] + H(π)
Summary Table
| Algorithm | Type | Model | On/Off policy |
|---|---|---|---|
| TD(0) / TD(λ) | Value | Free | On |
| Q-learning | Value | Free | Off |
| SARSA | Value | Free | On |
| DQN | Value (deep) | Free | Off |
| REINFORCE | Policy gradient | Free | On |
| A3C / PPO | Actor-critic | Free | On |
| SAC | Actor-critic + entropy | Free | Off |
| Dyna | Model-based | Learned | Both |
Connection to DynamICCL
DynamICCL = RL for NCCL optimization: - State: observed bandwidth, latency, GPU utilization, current NCCL parameters - Action: choose algorithm (ring/tree/recursive), buffer size, chunk size - Reward: throughput improvement over baseline - Algorithm: PPO or SAC for continuous/discrete parameter spaces - Exploration: ε-greedy over algorithm choices; continuous exploration in SAC via entropy