Advanced RL: Model-Based, Reward Shaping, and Applications

Model-Based RL

Learn a model of the environment (transition + reward), then plan with it.

Dyna architecture (Sutton, 1990):

while learning:
    -- Real experience
    take action a in state s, observe r, s'
    update Q(s,a) from real experience
    update model: M(s,a) ← (r, s')
    -- Simulated experience (planning)
    for n steps:
        s̃, ã ← random previously seen (state, action)
        r̃, s̃' ← M(s̃, ã)
        update Q(s̃,ã) from simulated experience

Benefit: n simulated steps per real step → much faster learning in data-limited settings.

Model error: simulated experience can be misleading if model is wrong → careful balance of real vs. simulated.


World Models

Deep network learns a compressed representation z_t and forward model:

z_t = Encoder(s_t)
z_{t+1} = TransitionModel(z_t, a_t)
r_t = RewardModel(z_t, a_t)

Planning/RL in the latent space z — much cheaper than planning in raw observation space.

Examples: DreamerV3, MuZero (model-based MCTS + RL).


Reward Shaping

Sparse rewards (goal achieved or not) cause slow learning. Reward shaping adds auxiliary rewards.

Potential-based shaping (Ng, Harada, Russell, 1999):

r'(s,a,s') = r(s,a,s') + γ·F(s') - F(s)

Where F: S → R is any potential function.

Guarantee: potential-based shaping preserves optimal policy (any optimal policy under r’ is also optimal under r).

Common potentials: distance to goal, heuristic value estimates, human demonstrations.


Reward Learning (Inverse RL)

When the reward function is unknown, infer it from expert demonstrations.

Inverse RL (IRL): find R such that expert policy π* is optimal.

Maximum entropy IRL: find R that maximizes the likelihood of demonstrations under the max-entropy (least committed) distribution over trajectories.

Applications: autonomous driving, robotics, RLHF (RL from Human Feedback for LLMs).


Policy Constraints and Safety

Constrained MDP (CMDP): maximize reward subject to constraint C_i ≥ c_i.

max_π E[Σ γᵗ r_t]
s.t. E[Σ γᵗ c_i(s_t,a_t)] ≥ c_i  for all i

Lagrangian approach: introduce λᵢ per constraint; dual ascent on λᵢ.

CPO (Constrained Policy Optimization): trust region methods for safe policy updates.

Applications: safe RL for robotics, medical, financial systems.


Hierarchical RL (HRL)

Decompose problems into subgoals:

Options framework: an option ⟨Iₒ, πₒ, βₒ⟩ with: - Iₒ: initiation set (when option can start) - πₒ: option’s policy - βₒ: termination condition

High-level policy selects options; low-level policies execute them.

MAXQ: decompose V*(s,a) into subtask value functions.

Feudal RL: manager sets goals for worker; worker receives intrinsic reward for achieving goals.


Multi-Task and Transfer RL

Multi-task: train single policy on multiple tasks → generalization.

Transfer: train on source tasks → fine-tune on target task.

Domain randomization: vary environment parameters during training → policy is robust to variation.

Key challenge: negative transfer (source task misleads target learning).


Key Applications

Domain Algorithm Achievement
Atari 2600 DQN Human-level on 49 games
Go/Chess AlphaZero Superhuman
Robotics PPO, SAC Dexterous manipulation
Dota 2 PPO + self-play Beats world champions
Protein folding AlphaFold SOTA structure prediction
LLM alignment PPO + RLHF GPT-4, Claude

RL Theory: Key Results

PAC-MDP (Probably Approximately Correct): R-MAX algorithm achieves ε-optimal policy in time polynomial in |S|, |A|, 1/ε.

Sample complexity: Q-learning needs Õ(|S|·|A|/((1-γ)³·ε²)) samples to find ε-optimal policy.

Regret bounds: UCRL2 algorithm achieves Õ(√(|S|·|A|·T)) regret over T steps.


Connection to DynamICCL