Reinforcement Learning Fundamentals

Chapter 22 — Reinforcement Learning Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 810–867


The RL Problem

An agent interacts with an environment, receiving rewards, without access to the environment model. It must learn from experience to maximize cumulative reward.

Key distinction from supervised learning: no labeled examples — only reward signals (often sparse, delayed).

Key distinction from planning (MDPs): transition model P(s’|s,a) and reward R(s,a) are unknown — must be learned from experience.


RL Framework

Agent-environment interaction:

State s_t → Action a_t → Reward r_t + Next state s_{t+1} → ...

Goal: find policy π that maximizes:

G_t = Σ_{k=0}^∞ γᵏ · r_{t+k+1}    -- return (discounted sum of future rewards)

Value functions (predictions of future reward):

V^π(s) = E_π[G_t | S_t=s]
Q^π(s,a) = E_π[G_t | S_t=s, A_t=a]

Passive RL: Policy Evaluation

Agent follows a fixed policy π and learns V^π from experience.

Temporal Difference (TD) Learning

V(s) ← V(s) + α [r + γ·V(s') - V(s)]

TD target: r + γ·V(s’) — one-step bootstrapped estimate TD error: δ = r + γ·V(s’) - V(s)

TD(0) updates immediately after each step. Converges to V^π with decreasing α.

Monte Carlo: update only at end of episode with actual return G_t. Higher variance, unbiased.

TD(λ): interpolates between TD(0) and MC using eligibility traces.


Active RL: Policy Improvement

Agent must also choose actions to maximize reward.

Exploration vs. Exploitation Tradeoff

ε-greedy: with prob ε choose random action; else choose argmax Q.


Q-Learning (Off-Policy TD Control)

Q(s,a) ← Q(s,a) + α [r + γ·max_{a'} Q(s',a') - Q(s,a)]

Off-policy: target policy = greedy (best known), behavior policy = ε-greedy.

Convergence: to Q* with probability 1 if: 1. Every (s,a) pair visited infinitely often 2. Learning rates α satisfy Σα=∞, Σα²<∞

Proof sketch: Q-learning update = contraction mapping toward Q*; same as value iteration but with samples.


SARSA (On-Policy TD Control)

Q(s,a) ← Q(s,a) + α [r + γ·Q(s',a') - Q(s,a)]

Where a’ is the action actually taken (sampled from current ε-greedy policy).

On-policy: learns Q^π (value of the current policy, not the greedy).

Q-learning is more aggressive (can be unsafe); SARSA is safer (learns to avoid dangerous actions).


Deep Q-Networks (DQN)

Q-table doesn’t scale to large state spaces. Approximate Q(s,a) with a neural network:

Q(s,a; θ) ≈ Q*(s,a)

DQN innovations (Mnih et al., 2015): 1. Experience replay: store transitions in replay buffer; sample random minibatches - Breaks temporal correlations → stabilizes training 2. Target network: separate θ⁻ for targets; update slowly - Prevents chasing a moving target

Loss: L(θ) = E_{(s,a,r,s')~D}[(r + γ·max_{a'} Q(s',a';θ⁻) - Q(s,a;θ))²]

DQN achieved human-level performance on 49 Atari games from raw pixels.


Policy Gradient Methods

Instead of learning V/Q and deriving policy, directly optimize policy parameters θ.

Policy gradient theorem:

∇_θ J(θ) = E_π[∇_θ log π(a|s;θ) · Q^π(s,a)]

REINFORCE (Monte Carlo policy gradient):

θ ← θ + α · G_t · ∇_θ log π(a_t|s_t;θ)

High variance — reduce with baseline b(s):

θ ← θ + α · (G_t - b(s_t)) · ∇_θ log π(a_t|s_t;θ)

Actor-Critic Methods

Combine policy gradient (actor) with value function approximation (critic):

Actor:  θ ← θ + α · δ · ∇_θ log π(a|s;θ)
Critic: w ← w + β · δ · ∇_w V(s;w)
TD error: δ = r + γ·V(s';w) - V(s;w)

Advantage function: A(s,a) = Q(s,a) - V(s) — reduces variance vs. raw Q.

Modern variants: - A3C/A2C: asynchronous parallel actors - PPO (Proximal Policy Optimization): clipped ratio objective → stable training L_CLIP = E[min(rₜ·Â, clip(rₜ, 1-ε, 1+ε)·Â)] where rₜ = π_new/π_old - SAC (Soft Actor-Critic): maximum entropy RL — maximize E[R] + H(π)


Summary Table

Algorithm Type Model On/Off policy
TD(0) / TD(λ) Value Free On
Q-learning Value Free Off
SARSA Value Free On
DQN Value (deep) Free Off
REINFORCE Policy gradient Free On
A3C / PPO Actor-critic Free On
SAC Actor-critic + entropy Free Off
Dyna Model-based Learned Both

Connection to DynamICCL

DynamICCL = RL for NCCL optimization: - State: observed bandwidth, latency, GPU utilization, current NCCL parameters - Action: choose algorithm (ring/tree/recursive), buffer size, chunk size - Reward: throughput improvement over baseline - Algorithm: PPO or SAC for continuous/discrete parameter spaces - Exploration: ε-greedy over algorithm choices; continuous exploration in SAC via entropy