Reinforcement Learning Fundamentals

Chapter 22 — Reinforcement Learning Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 810–867

The RL Problem

An agent interacts with an environment, receiving rewards, without access to the environment model. It must learn from experience to maximize cumulative reward.

Key distinction from supervised learning: no labeled examples — only reward signals (often sparse, delayed).

Key distinction from planning (MDPs): transition model P(s’|s,a) and reward R(s,a) are unknown — must be learned from experience.

RL Framework

Agent-environment interaction:

State s_t → Action a_t → Reward r_t + Next state s_{t+1} → ...

Goal: find policy π that maximizes:

G_t = Σ_{k=0}^∞ γᵏ · r_{t+k+1}    -- return (discounted sum of future rewards)

Value functions (predictions of future reward):

V^π(s) = E_π[G_t | S_t=s]
Q^π(s,a) = E_π[G_t | S_t=s, A_t=a]

Passive RL: Policy Evaluation

Agent follows a fixed policy π and learns V^π from experience.

Temporal Difference (TD) Learning

V(s) ← V(s) + α [r + γ·V(s') - V(s)]

TD target: r + γ·V(s’) — one-step bootstrapped estimate TD error: δ = r + γ·V(s’) - V(s)

TD(0) updates immediately after each step. Converges to V^π with decreasing α.

Monte Carlo: update only at end of episode with actual return G_t. Higher variance, unbiased.

TD(λ): interpolates between TD(0) and MC using eligibility traces.

Active RL: Policy Improvement

Agent must also choose actions to maximize reward.

Exploration vs. Exploitation Tradeoff

Exploit: choose action with highest known value
Explore: try actions to learn better values

ε-greedy: with prob ε choose random action; else choose argmax Q.

Q-Learning (Off-Policy TD Control)

Q(s,a) ← Q(s,a) + α [r + γ·max_{a'} Q(s',a') - Q(s,a)]

Off-policy: target policy = greedy (best known), behavior policy = ε-greedy.

Convergence: to Q* with probability 1 if: 1. Every (s,a) pair visited infinitely often 2. Learning rates α satisfy Σα=∞, Σα²<∞

Proof sketch: Q-learning update = contraction mapping toward Q*; same as value iteration but with samples.

SARSA (On-Policy TD Control)

Q(s,a) ← Q(s,a) + α [r + γ·Q(s',a') - Q(s,a)]

Where a’ is the action actually taken (sampled from current ε-greedy policy).

On-policy: learns Q^π (value of the current policy, not the greedy).

Q-learning is more aggressive (can be unsafe); SARSA is safer (learns to avoid dangerous actions).

Deep Q-Networks (DQN)

Q-table doesn’t scale to large state spaces. Approximate Q(s,a) with a neural network:

Q(s,a; θ) ≈ Q*(s,a)

DQN innovations (Mnih et al., 2015): 1. Experience replay: store transitions in replay buffer; sample random minibatches - Breaks temporal correlations → stabilizes training 2. Target network: separate θ⁻ for targets; update slowly - Prevents chasing a moving target

Loss: L(θ) = E_{(s,a,r,s')~D}[(r + γ·max_{a'} Q(s',a';θ⁻) - Q(s,a;θ))²]

DQN achieved human-level performance on 49 Atari games from raw pixels.

Policy Gradient Methods

Instead of learning V/Q and deriving policy, directly optimize policy parameters θ.

Policy gradient theorem:

∇_θ J(θ) = E_π[∇_θ log π(a|s;θ) · Q^π(s,a)]

REINFORCE (Monte Carlo policy gradient):

θ ← θ + α · G_t · ∇_θ log π(a_t|s_t;θ)

High variance — reduce with baseline b(s):

θ ← θ + α · (G_t - b(s_t)) · ∇_θ log π(a_t|s_t;θ)

Actor-Critic Methods

Combine policy gradient (actor) with value function approximation (critic):

Actor:  θ ← θ + α · δ · ∇_θ log π(a|s;θ)
Critic: w ← w + β · δ · ∇_w V(s;w)
TD error: δ = r + γ·V(s';w) - V(s;w)

Advantage function: A(s,a) = Q(s,a) - V(s) — reduces variance vs. raw Q.

Modern variants: - A3C/A2C: asynchronous parallel actors - PPO (Proximal Policy Optimization): clipped ratio objective → stable training L_CLIP = E[min(rₜ·Â, clip(rₜ, 1-ε, 1+ε)·Â)] where rₜ = π_new/π_old - SAC (Soft Actor-Critic): maximum entropy RL — maximize E[R] + H(π)

Summary Table

Algorithm	Type	Model	On/Off policy
TD(0) / TD(λ)	Value	Free	On
Q-learning	Value	Free	Off
SARSA	Value	Free	On
DQN	Value (deep)	Free	Off
REINFORCE	Policy gradient	Free	On
A3C / PPO	Actor-critic	Free	On
SAC	Actor-critic + entropy	Free	Off
Dyna	Model-based	Learned	Both

Connection to DynamICCL

DynamICCL = RL for NCCL optimization: - State: observed bandwidth, latency, GPU utilization, current NCCL parameters - Action: choose algorithm (ring/tree/recursive), buffer size, chunk size - Reward: throughput improvement over baseline - Algorithm: PPO or SAC for continuous/discrete parameter spaces - Exploration: ε-greedy over algorithm choices; continuous exploration in SAC via entropy