Online Search Agents and Unknown Environments

Offline vs. Online Search

Offline	Online
Full map known before search	Map discovered during execution
Computes full plan, then executes	Interleaves computation and execution
Can backtrack virtually	Backtracking may be impossible (physical world)

Online search is necessary when: - The environment is unknown and must be explored - Actions have costs (backtracking is expensive) - The agent must act in real time

Online Search Problem

Formulation: - Actions: finite set available at current state - Step-cost function: s(s, a) known only when action taken - Goal test: recognized when goal is reached

The agent cannot compute future costs in advance — it learns the map as it explores.

Competitive Ratio

For online search, optimality is replaced by competitive ratio:

Competitive ratio = (cost of online solution) / (cost of optimal offline solution)

Ratio = 1: optimal (same as if map known in advance) — usually unachievable
Ratio = c: at most c times worse than optimal

For a connected graph with b actions per state, the competitive ratio of any online algorithm is at least b (unavoidable: must explore dead ends).

LRTA* (Learning Real-Time A*)

A practical online search algorithm that: 1. Learns H-values (heuristic estimates) from experience 2. Makes locally optimal moves w.r.t. current H-estimates 3. Updates H-values based on actual costs observed

function LRTA*-AGENT(s') returns an action
    -- s' is current state (percept)
    if GOAL-TEST(s') then return stop
    if s' is new then H[s'] ← h(s')   -- initialize from heuristic
    if s is not null then
        -- Update H value of previous state
        H[s] ← min over each a in ACTIONS(s) of:
                   LRTA*-COST(s, a, s', H)
    -- Choose action minimizing LRTA*-COST
    a* ← argmin over a in ACTIONS(s) of LRTA*-COST(s, a, s', H)
    s, a ← s', a*
    return a*

function LRTA*-COST(s, a, s', H)
    if s' is undefined then return h(s)
    return c(s, a, s') + H[s']

Where c(s, a, s’) is the step cost of taking action a in state s leading to s’.

Key Insight

LRTA* updates H[s] to be at least as large as the best estimated cost to the goal through any neighbor. This makes H-values consistent over time — they converge to exact values with repeated trials.

Properties of LRTA*

Property	Value
Complete (finite state space)	Yes — guaranteed to find goal
Optimal	No — may revisit states
H convergence	Yes — after sufficient exploration
Memory	O(states visited)

LRTA* is complete as long as: 1. State space is finite 2. State space is safely explorable (no dead ends from which goal is unreachable) 3. Step costs are uniform (or bounded)

Exploration vs. Exploitation

Online search faces the fundamental exploration-exploitation tradeoff: - Exploit: follow current best estimates → fast if estimates are good - Explore: gather information → improves future decisions

LRTA* is greedy (exploitative) but forces exploration through backtracking when trapped.

More sophisticated approaches (ε-greedy, UCB) come from the RL literature (Ch.22, multi-armed bandits).

Connection to RL

Online search = online RL without rewards (only goal/non-goal).

LRTA* updating H-values = TD learning updating value function estimates.

The Bellman update in RL (V(s) ← r + γ · V(s’)) is exactly the LRTA* cost update applied to reward-shaped value functions.

DynamICCL uses online RL: the agent discovers NCCL performance characteristics during actual training runs (no offline simulator) and adapts policy continuously — direct application of online search principles.