Planning Under Uncertainty
Classical vs. Uncertain Planning
Classical planning assumes: - Fully observable: agent knows exact state - Deterministic: actions have exactly one outcome - Static: world doesn’t change except via agent’s actions
Real-world planning violates all three. This file covers extensions for uncertainty.
Conformant Planning (Sensorless)
No observations — the agent must plan without knowing the state.
Belief-state planning (from Ch.4): - State = belief
state (set of possible worlds) - Action applied to belief state:
B' = {RESULT(s, a) : s ∈ B} - Goal: reach a belief state
where all members satisfy the goal
Conformant plan: works regardless of the actual initial state.
Example: robot must clean all squares, but doesn’t know which are dirty. - Plan: clean square 1, then square 2 (regardless of initial dirt) - May be suboptimal but guaranteed to achieve goal
Contingency Planning (Partial Observability)
With sensors, the plan can branch based on observations:
[action_1,
if percept_A: [action_2a, subplan_A],
if percept_B: [action_2b, subplan_B]
]
- Uses AND-OR search over belief states (Ch.4)
- Produces a conditional plan (policy tree)
Replanning
A pragmatic alternative to full contingency planning:
- Execute a classical plan
- If unexpected state occurs (monitoring detects failure):
- Diagnose: what went wrong?
- Replan from current state
Advantages: simpler than building full contingency plans upfront; often sufficient for well-behaved environments.
Disadvantages: may fail if recovery is hard (reactive replanning may be too slow).
Probabilistic Planning (MDPs)
When action outcomes are stochastic with known probabilities:
P(s' | s, a) = probability of reaching s' from s by action a
The optimal plan maximizes expected utility = expected sum of rewards.
This is exactly a Markov Decision Process (MDP) (Chapter 17).
Stochastic planning → MDP/POMDP for the most general setting: - Stochastic actions: MDP - Partial observability + stochastic: POMDP
Continuous Domains
Real-world planning often involves continuous state and action spaces:
Example: robot motion planning — state = (x, y, θ), actions = (v, ω) continuous.
Approaches: - Sampling-based: RRT (Rapidly-exploring Random Trees), PRM (Probabilistic Roadmap) - Optimization-based: trajectory optimization (iLQR, CHOMP) - Model-based RL: learn a dynamics model, plan in the learned model
Multi-Agent Planning
Multiple agents coordinating (or competing):
Cooperative: joint plan maximizes team reward Competitive: each agent maximizes own reward (game theory) Mixed: some agents cooperative, some competitive
Multi-agent planning is exponential in the number of agents — solved approximately via: - Decentralized planning - Communication protocols - Centralized training, decentralized execution (CTDE) in MARL
Connection to RL
| Planning Setting | RL Analog |
|---|---|
| Classical planning | Deterministic MDP |
| Stochastic planning | MDP (model known) |
| POMDP | Partially Observable MDP |
| Replanning | Model-based RL |
| Online planning | MCTS + policy gradient (AlphaZero) |
Key distinction: classical planning has an explicit model (PDDL domain). RL learns the model (or policy directly) from experience.
Dyna architecture (Sutton): combine RL (learning from experience) with planning (simulating experience using the learned model) — bridges the two worlds.
For DynamICCL: RL learns to plan NCCL parameter adjustments online, using experience from actual training runs — exactly the Dyna setting without a pre-specified domain model.