Online Search Agents and Unknown Environments

Offline Online
Full map known before search Map discovered during execution
Computes full plan, then executes Interleaves computation and execution
Can backtrack virtually Backtracking may be impossible (physical world)

Online search is necessary when: - The environment is unknown and must be explored - Actions have costs (backtracking is expensive) - The agent must act in real time


Online Search Problem

Formulation: - Actions: finite set available at current state - Step-cost function: s(s, a) known only when action taken - Goal test: recognized when goal is reached

The agent cannot compute future costs in advance — it learns the map as it explores.


Competitive Ratio

For online search, optimality is replaced by competitive ratio:

Competitive ratio = (cost of online solution) / (cost of optimal offline solution)

For a connected graph with b actions per state, the competitive ratio of any online algorithm is at least b (unavoidable: must explore dead ends).


LRTA* (Learning Real-Time A*)

A practical online search algorithm that: 1. Learns H-values (heuristic estimates) from experience 2. Makes locally optimal moves w.r.t. current H-estimates 3. Updates H-values based on actual costs observed

function LRTA*-AGENT(s') returns an action
    -- s' is current state (percept)
    if GOAL-TEST(s') then return stop
    if s' is new then H[s'] ← h(s')   -- initialize from heuristic
    if s is not null then
        -- Update H value of previous state
        H[s] ← min over each a in ACTIONS(s) of:
                   LRTA*-COST(s, a, s', H)
    -- Choose action minimizing LRTA*-COST
    a* ← argmin over a in ACTIONS(s) of LRTA*-COST(s, a, s', H)
    s, a ← s', a*
    return a*

function LRTA*-COST(s, a, s', H)
    if s' is undefined then return h(s)
    return c(s, a, s') + H[s']

Where c(s, a, s’) is the step cost of taking action a in state s leading to s’.

Key Insight

LRTA* updates H[s] to be at least as large as the best estimated cost to the goal through any neighbor. This makes H-values consistent over time — they converge to exact values with repeated trials.


Properties of LRTA*

Property Value
Complete (finite state space) Yes — guaranteed to find goal
Optimal No — may revisit states
H convergence Yes — after sufficient exploration
Memory O(states visited)

LRTA* is complete as long as: 1. State space is finite 2. State space is safely explorable (no dead ends from which goal is unreachable) 3. Step costs are uniform (or bounded)


Exploration vs. Exploitation

Online search faces the fundamental exploration-exploitation tradeoff: - Exploit: follow current best estimates → fast if estimates are good - Explore: gather information → improves future decisions

LRTA* is greedy (exploitative) but forces exploration through backtracking when trapped.

More sophisticated approaches (ε-greedy, UCB) come from the RL literature (Ch.22, multi-armed bandits).


Connection to RL

Online search = online RL without rewards (only goal/non-goal).

LRTA* updating H-values = TD learning updating value function estimates.

The Bellman update in RL (V(s) ← r + γ · V(s’)) is exactly the LRTA* cost update applied to reward-shaped value functions.

DynamICCL uses online RL: the agent discovers NCCL performance characteristics during actual training runs (no offline simulator) and adapts policy continuously — direct application of online search principles.