Advanced RL: Model-Based, Reward Shaping, and Applications

Model-Based RL

Learn a model of the environment (transition + reward), then plan with it.

Dyna architecture (Sutton, 1990):

while learning:
    -- Real experience
    take action a in state s, observe r, s'
    update Q(s,a) from real experience
    update model: M(s,a) ← (r, s')
    -- Simulated experience (planning)
    for n steps:
        s̃, ã ← random previously seen (state, action)
        r̃, s̃' ← M(s̃, ã)
        update Q(s̃,ã) from simulated experience

Benefit: n simulated steps per real step → much faster learning in data-limited settings.

Model error: simulated experience can be misleading if model is wrong → careful balance of real vs. simulated.

World Models

Deep network learns a compressed representation z_t and forward model:

z_t = Encoder(s_t)
z_{t+1} = TransitionModel(z_t, a_t)
r_t = RewardModel(z_t, a_t)

Planning/RL in the latent space z — much cheaper than planning in raw observation space.

Examples: DreamerV3, MuZero (model-based MCTS + RL).

Reward Shaping

Sparse rewards (goal achieved or not) cause slow learning. Reward shaping adds auxiliary rewards.

Potential-based shaping (Ng, Harada, Russell, 1999):

r'(s,a,s') = r(s,a,s') + γ·F(s') - F(s)

Where F: S → R is any potential function.

Guarantee: potential-based shaping preserves optimal policy (any optimal policy under r’ is also optimal under r).

Common potentials: distance to goal, heuristic value estimates, human demonstrations.

Reward Learning (Inverse RL)

When the reward function is unknown, infer it from expert demonstrations.

Inverse RL (IRL): find R such that expert policy π* is optimal.

Maximum entropy IRL: find R that maximizes the likelihood of demonstrations under the max-entropy (least committed) distribution over trajectories.

Applications: autonomous driving, robotics, RLHF (RL from Human Feedback for LLMs).

Policy Constraints and Safety

Constrained MDP (CMDP): maximize reward subject to constraint C_i ≥ c_i.

max_π E[Σ γᵗ r_t]
s.t. E[Σ γᵗ c_i(s_t,a_t)] ≥ c_i  for all i

Lagrangian approach: introduce λᵢ per constraint; dual ascent on λᵢ.

CPO (Constrained Policy Optimization): trust region methods for safe policy updates.

Applications: safe RL for robotics, medical, financial systems.

Hierarchical RL (HRL)

Decompose problems into subgoals:

Options framework: an option ⟨Iₒ, πₒ, βₒ⟩ with: - Iₒ: initiation set (when option can start) - πₒ: option’s policy - βₒ: termination condition

High-level policy selects options; low-level policies execute them.

MAXQ: decompose V*(s,a) into subtask value functions.

Feudal RL: manager sets goals for worker; worker receives intrinsic reward for achieving goals.

Multi-Task and Transfer RL

Multi-task: train single policy on multiple tasks → generalization.

Transfer: train on source tasks → fine-tune on target task.

Domain randomization: vary environment parameters during training → policy is robust to variation.

Key challenge: negative transfer (source task misleads target learning).

Key Applications

Domain	Algorithm	Achievement
Atari 2600	DQN	Human-level on 49 games
Go/Chess	AlphaZero	Superhuman
Robotics	PPO, SAC	Dexterous manipulation
Dota 2	PPO + self-play	Beats world champions
Protein folding	AlphaFold	SOTA structure prediction
LLM alignment	PPO + RLHF	GPT-4, Claude

RL Theory: Key Results

PAC-MDP (Probably Approximately Correct): R-MAX algorithm achieves ε-optimal policy in time polynomial in |S|, |A|, 1/ε.

Sample complexity: Q-learning needs Õ(|S|·|A|/((1-γ)³·ε²)) samples to find ε-optimal policy.

Regret bounds: UCRL2 algorithm achieves Õ(√(|S|·|A|·T)) regret over T steps.

Connection to DynamICCL

Dyna: learn NCCL transition model from real runs, plan with it → much faster adaptation
Reward shaping: potential = estimated performance improvement → faster convergence in sparse throughput signals
Constrained MDP: constrain NCCL parameter changes to be within safe bounds (no large sudden changes)
HRL: high-level policy selects communication algorithm; low-level policy tunes buffer/chunk sizes
PPO: the go-to algorithm for DynamICCL given stable continuous parameter optimization