Planning Under Uncertainty

Classical vs. Uncertain Planning

Classical planning assumes: - Fully observable: agent knows exact state - Deterministic: actions have exactly one outcome - Static: world doesn’t change except via agent’s actions

Real-world planning violates all three. This file covers extensions for uncertainty.


Conformant Planning (Sensorless)

No observations — the agent must plan without knowing the state.

Belief-state planning (from Ch.4): - State = belief state (set of possible worlds) - Action applied to belief state: B' = {RESULT(s, a) : s ∈ B} - Goal: reach a belief state where all members satisfy the goal

Conformant plan: works regardless of the actual initial state.

Example: robot must clean all squares, but doesn’t know which are dirty. - Plan: clean square 1, then square 2 (regardless of initial dirt) - May be suboptimal but guaranteed to achieve goal


Contingency Planning (Partial Observability)

With sensors, the plan can branch based on observations:

[action_1,
  if percept_A: [action_2a, subplan_A],
  if percept_B: [action_2b, subplan_B]
]

Replanning

A pragmatic alternative to full contingency planning:

  1. Execute a classical plan
  2. If unexpected state occurs (monitoring detects failure):
    • Diagnose: what went wrong?
    • Replan from current state

Advantages: simpler than building full contingency plans upfront; often sufficient for well-behaved environments.

Disadvantages: may fail if recovery is hard (reactive replanning may be too slow).


Probabilistic Planning (MDPs)

When action outcomes are stochastic with known probabilities:

P(s' | s, a) = probability of reaching s' from s by action a

The optimal plan maximizes expected utility = expected sum of rewards.

This is exactly a Markov Decision Process (MDP) (Chapter 17).

Stochastic planningMDP/POMDP for the most general setting: - Stochastic actions: MDP - Partial observability + stochastic: POMDP


Continuous Domains

Real-world planning often involves continuous state and action spaces:

Example: robot motion planning — state = (x, y, θ), actions = (v, ω) continuous.

Approaches: - Sampling-based: RRT (Rapidly-exploring Random Trees), PRM (Probabilistic Roadmap) - Optimization-based: trajectory optimization (iLQR, CHOMP) - Model-based RL: learn a dynamics model, plan in the learned model


Multi-Agent Planning

Multiple agents coordinating (or competing):

Cooperative: joint plan maximizes team reward Competitive: each agent maximizes own reward (game theory) Mixed: some agents cooperative, some competitive

Multi-agent planning is exponential in the number of agents — solved approximately via: - Decentralized planning - Communication protocols - Centralized training, decentralized execution (CTDE) in MARL


Connection to RL

Planning Setting RL Analog
Classical planning Deterministic MDP
Stochastic planning MDP (model known)
POMDP Partially Observable MDP
Replanning Model-based RL
Online planning MCTS + policy gradient (AlphaZero)

Key distinction: classical planning has an explicit model (PDDL domain). RL learns the model (or policy directly) from experience.

Dyna architecture (Sutton): combine RL (learning from experience) with planning (simulating experience using the learned model) — bridges the two worlds.

For DynamICCL: RL learns to plan NCCL parameter adjustments online, using experience from actual training runs — exactly the Dyna setting without a pre-specified domain model.