Online Search Agents and Unknown Environments
Offline vs. Online Search
| Offline | Online |
|---|---|
| Full map known before search | Map discovered during execution |
| Computes full plan, then executes | Interleaves computation and execution |
| Can backtrack virtually | Backtracking may be impossible (physical world) |
Online search is necessary when: - The environment is unknown and must be explored - Actions have costs (backtracking is expensive) - The agent must act in real time
Online Search Problem
Formulation: - Actions: finite set available at current state - Step-cost function: s(s, a) known only when action taken - Goal test: recognized when goal is reached
The agent cannot compute future costs in advance — it learns the map as it explores.
Competitive Ratio
For online search, optimality is replaced by competitive ratio:
Competitive ratio = (cost of online solution) / (cost of optimal offline solution)
- Ratio = 1: optimal (same as if map known in advance) — usually unachievable
- Ratio = c: at most c times worse than optimal
For a connected graph with b actions per state, the competitive ratio of any online algorithm is at least b (unavoidable: must explore dead ends).
LRTA* (Learning Real-Time A*)
A practical online search algorithm that: 1. Learns H-values (heuristic estimates) from experience 2. Makes locally optimal moves w.r.t. current H-estimates 3. Updates H-values based on actual costs observed
function LRTA*-AGENT(s') returns an action
-- s' is current state (percept)
if GOAL-TEST(s') then return stop
if s' is new then H[s'] ← h(s') -- initialize from heuristic
if s is not null then
-- Update H value of previous state
H[s] ← min over each a in ACTIONS(s) of:
LRTA*-COST(s, a, s', H)
-- Choose action minimizing LRTA*-COST
a* ← argmin over a in ACTIONS(s) of LRTA*-COST(s, a, s', H)
s, a ← s', a*
return a*
function LRTA*-COST(s, a, s', H)
if s' is undefined then return h(s)
return c(s, a, s') + H[s']
Where c(s, a, s’) is the step cost of taking action a in state s leading to s’.
Key Insight
LRTA* updates H[s] to be at least as large as the best estimated cost to the goal through any neighbor. This makes H-values consistent over time — they converge to exact values with repeated trials.
Properties of LRTA*
| Property | Value |
|---|---|
| Complete (finite state space) | Yes — guaranteed to find goal |
| Optimal | No — may revisit states |
| H convergence | Yes — after sufficient exploration |
| Memory | O(states visited) |
LRTA* is complete as long as: 1. State space is finite 2. State space is safely explorable (no dead ends from which goal is unreachable) 3. Step costs are uniform (or bounded)
Exploration vs. Exploitation
Online search faces the fundamental exploration-exploitation tradeoff: - Exploit: follow current best estimates → fast if estimates are good - Explore: gather information → improves future decisions
LRTA* is greedy (exploitative) but forces exploration through backtracking when trapped.
More sophisticated approaches (ε-greedy, UCB) come from the RL literature (Ch.22, multi-armed bandits).
Connection to RL
Online search = online RL without rewards (only goal/non-goal).
LRTA* updating H-values = TD learning updating value function estimates.
The Bellman update in RL (V(s) ← r + γ · V(s’)) is exactly the LRTA* cost update applied to reward-shaped value functions.
DynamICCL uses online RL: the agent discovers NCCL performance characteristics during actual training runs (no offline simulator) and adapts policy continuously — direct application of online search principles.