Planning Under Uncertainty

Classical vs. Uncertain Planning

Classical planning assumes: - Fully observable: agent knows exact state - Deterministic: actions have exactly one outcome - Static: world doesn’t change except via agent’s actions

Real-world planning violates all three. This file covers extensions for uncertainty.

Conformant Planning (Sensorless)

No observations — the agent must plan without knowing the state.

Belief-state planning (from Ch.4): - State = belief state (set of possible worlds) - Action applied to belief state: B' = {RESULT(s, a) : s ∈ B} - Goal: reach a belief state where all members satisfy the goal

Conformant plan: works regardless of the actual initial state.

Example: robot must clean all squares, but doesn’t know which are dirty. - Plan: clean square 1, then square 2 (regardless of initial dirt) - May be suboptimal but guaranteed to achieve goal

Contingency Planning (Partial Observability)

With sensors, the plan can branch based on observations:

[action_1,
  if percept_A: [action_2a, subplan_A],
  if percept_B: [action_2b, subplan_B]
]

Uses AND-OR search over belief states (Ch.4)
Produces a conditional plan (policy tree)

Replanning

A pragmatic alternative to full contingency planning:

Execute a classical plan
If unexpected state occurs (monitoring detects failure):
- Diagnose: what went wrong?
- Replan from current state

Advantages: simpler than building full contingency plans upfront; often sufficient for well-behaved environments.

Disadvantages: may fail if recovery is hard (reactive replanning may be too slow).

Probabilistic Planning (MDPs)

When action outcomes are stochastic with known probabilities:

P(s' | s, a) = probability of reaching s' from s by action a

The optimal plan maximizes expected utility = expected sum of rewards.

This is exactly a Markov Decision Process (MDP) (Chapter 17).

Stochastic planning → MDP/POMDP for the most general setting: - Stochastic actions: MDP - Partial observability + stochastic: POMDP

Continuous Domains

Real-world planning often involves continuous state and action spaces:

Example: robot motion planning — state = (x, y, θ), actions = (v, ω) continuous.

Approaches: - Sampling-based: RRT (Rapidly-exploring Random Trees), PRM (Probabilistic Roadmap) - Optimization-based: trajectory optimization (iLQR, CHOMP) - Model-based RL: learn a dynamics model, plan in the learned model

Multi-Agent Planning

Multiple agents coordinating (or competing):

Cooperative: joint plan maximizes team reward Competitive: each agent maximizes own reward (game theory) Mixed: some agents cooperative, some competitive

Multi-agent planning is exponential in the number of agents — solved approximately via: - Decentralized planning - Communication protocols - Centralized training, decentralized execution (CTDE) in MARL

Connection to RL

Planning Setting	RL Analog
Classical planning	Deterministic MDP
Stochastic planning	MDP (model known)
POMDP	Partially Observable MDP
Replanning	Model-based RL
Online planning	MCTS + policy gradient (AlphaZero)

Key distinction: classical planning has an explicit model (PDDL domain). RL learns the model (or policy directly) from experience.

Dyna architecture (Sutton): combine RL (learning from experience) with planning (simulating experience using the learned model) — bridges the two worlds.

For DynamICCL: RL learns to plan NCCL parameter adjustments online, using experience from actual training runs — exactly the Dyna setting without a pre-specified domain model.