Advanced RL: Model-Based, Reward Shaping, and Applications
Model-Based RL
Learn a model of the environment (transition + reward), then plan with it.
Dyna architecture (Sutton, 1990):
while learning:
-- Real experience
take action a in state s, observe r, s'
update Q(s,a) from real experience
update model: M(s,a) ← (r, s')
-- Simulated experience (planning)
for n steps:
s̃, ã ← random previously seen (state, action)
r̃, s̃' ← M(s̃, ã)
update Q(s̃,ã) from simulated experience
Benefit: n simulated steps per real step → much faster learning in data-limited settings.
Model error: simulated experience can be misleading if model is wrong → careful balance of real vs. simulated.
World Models
Deep network learns a compressed representation z_t and forward model:
z_t = Encoder(s_t)
z_{t+1} = TransitionModel(z_t, a_t)
r_t = RewardModel(z_t, a_t)
Planning/RL in the latent space z — much cheaper than planning in raw observation space.
Examples: DreamerV3, MuZero (model-based MCTS + RL).
Reward Shaping
Sparse rewards (goal achieved or not) cause slow learning. Reward shaping adds auxiliary rewards.
Potential-based shaping (Ng, Harada, Russell, 1999):
r'(s,a,s') = r(s,a,s') + γ·F(s') - F(s)
Where F: S → R is any potential function.
Guarantee: potential-based shaping preserves optimal policy (any optimal policy under r’ is also optimal under r).
Common potentials: distance to goal, heuristic value estimates, human demonstrations.
Reward Learning (Inverse RL)
When the reward function is unknown, infer it from expert demonstrations.
Inverse RL (IRL): find R such that expert policy π* is optimal.
Maximum entropy IRL: find R that maximizes the likelihood of demonstrations under the max-entropy (least committed) distribution over trajectories.
Applications: autonomous driving, robotics, RLHF (RL from Human Feedback for LLMs).
Policy Constraints and Safety
Constrained MDP (CMDP): maximize reward subject to constraint C_i ≥ c_i.
max_π E[Σ γᵗ r_t]
s.t. E[Σ γᵗ c_i(s_t,a_t)] ≥ c_i for all i
Lagrangian approach: introduce λᵢ per constraint; dual ascent on λᵢ.
CPO (Constrained Policy Optimization): trust region methods for safe policy updates.
Applications: safe RL for robotics, medical, financial systems.
Hierarchical RL (HRL)
Decompose problems into subgoals:
Options framework: an option ⟨Iₒ, πₒ, βₒ⟩ with: - Iₒ: initiation set (when option can start) - πₒ: option’s policy - βₒ: termination condition
High-level policy selects options; low-level policies execute them.
MAXQ: decompose V*(s,a) into subtask value functions.
Feudal RL: manager sets goals for worker; worker receives intrinsic reward for achieving goals.
Multi-Task and Transfer RL
Multi-task: train single policy on multiple tasks → generalization.
Transfer: train on source tasks → fine-tune on target task.
Domain randomization: vary environment parameters during training → policy is robust to variation.
Key challenge: negative transfer (source task misleads target learning).
Key Applications
| Domain | Algorithm | Achievement |
|---|---|---|
| Atari 2600 | DQN | Human-level on 49 games |
| Go/Chess | AlphaZero | Superhuman |
| Robotics | PPO, SAC | Dexterous manipulation |
| Dota 2 | PPO + self-play | Beats world champions |
| Protein folding | AlphaFold | SOTA structure prediction |
| LLM alignment | PPO + RLHF | GPT-4, Claude |
RL Theory: Key Results
PAC-MDP (Probably Approximately Correct): R-MAX algorithm achieves ε-optimal policy in time polynomial in |S|, |A|, 1/ε.
Sample complexity: Q-learning needs Õ(|S|·|A|/((1-γ)³·ε²)) samples to find ε-optimal policy.
Regret bounds: UCRL2 algorithm achieves Õ(√(|S|·|A|·T)) regret over T steps.
Connection to DynamICCL
- Dyna: learn NCCL transition model from real runs, plan with it → much faster adaptation
- Reward shaping: potential = estimated performance improvement → faster convergence in sparse throughput signals
- Constrained MDP: constrain NCCL parameter changes to be within safe bounds (no large sudden changes)
- HRL: high-level policy selects communication algorithm; low-level policy tunes buffer/chunk sizes
- PPO: the go-to algorithm for DynamICCL given stable continuous parameter optimization