7. Heuristic Functions
Source: AIMA 4th Ed., Chapter 3 (Section 3.6) Status: Complete
Overview
The quality of a heuristic function h(n) is the single
biggest factor in A* performance. A perfect heuristic (h = h) would
cause A to expand only nodes on the optimal path; a zero heuristic
(h = 0) reduces A* to UCS. This section covers how to measure heuristic
quality and how to systematically construct good heuristics.
Two Standard Heuristics for the 8-Puzzle
The 8-puzzle (3×3 sliding tile) is the standard benchmark. The start configuration has all 8 tiles out of place.
h1 — Misplaced tiles: - Count the number of tiles not in their goal position (blank not counted) - Example: for the standard instance (Figure 3.25), h1 = 8 (all tiles misplaced) - Admissible because each misplaced tile requires at least one move
h2 — Manhattan distance (city-block distance): - For
each tile, sum the horizontal + vertical distance from its current
position to its goal position - Example:
h2 = 3+1+2+2+2+3+3+2 = 18 - Admissible because tiles cannot
jump over each other; each unit of Manhattan distance requires at least
one move - True optimal solution length is 26 for this instance
h2 is strictly better than h1: for every node n,
h2(n) >= h1(n). We say h2 dominates
h1.
3.6.1 Heuristic Accuracy and the Effective Branching Factor
Effective Branching Factor (b*)
A quantitative measure of heuristic quality. If A* generates N nodes to find a solution at depth d, the effective branching factor b* satisfies:
N + 1 = 1 + b* + (b*)^2 + ... + (b*)^d
Interpretation: b* is the branching factor a uniform tree of depth d would need to contain N+1 nodes. A perfect heuristic gives b* = 1 (only the optimal path is expanded). b* close to 1 indicates an excellent heuristic.
Example: A* finds solution at depth 5 using 52 nodes → b* ≈ 1.92.
Comparison Table (8-puzzle, Figure 3.26)
| d | BFS nodes | A*(h1) nodes | A*(h2) nodes | b*(BFS) | b*(h1) | b*(h2) |
|---|---|---|---|---|---|---|
| 6 | 128 | 24 | 19 | 2.01 | 1.42 | 1.34 |
| 10 | 1033 | 116 | 48 | 1.85 | 1.43 | 1.27 |
| 14 | 6783 | 678 | 174 | 1.77 | 1.47 | 1.31 |
| 20 | 91493 | 9905 | 1318 | 1.69 | 1.50 | 1.34 |
| 28 | 463234 | 202565 | 22055 | 1.53 | 1.49 | 1.36 |
At d=14: BFS expands 6783, A(h1) expands 678, A(h2) expands 174 — roughly 40x fewer nodes than BFS.
Effective Depth Reduction (Korf and Reid, 1998)
A heuristic h reduces the effective depth by a constant k_h compared to the true depth. The total search cost becomes O(b^(d - k_h)) compared to O(b^d) for uninformed search. For h2, k_h ≈ 1.5 for the 8-puzzle domain.
3.6.2 Generating Heuristics from Relaxed Problems
Key insight: h1 and h2 are not arbitrary inventions — they are exact solutions to simplified versions of the 8-puzzle.
Definition: A relaxed problem is one where some constraints on actions are removed. The state-space graph of the relaxed problem is a supergraph of the original (more edges, therefore shorter or equal optimal paths).
Fundamental theorem: The optimal solution cost to a relaxed problem is an admissible heuristic for the original problem.
Furthermore, since the derived heuristic is an exact cost for the relaxed problem, it obeys the triangle inequality and is therefore consistent.
Deriving h1 and h2 from 8-puzzle rules
Original rule: “A tile can move from square X to square Y if X is adjacent to Y and Y is blank.”
Three relaxations by removing conditions: - (a) Remove “Y is blank”: tile can move to any adjacent square (even occupied) - Optimal solution = Manhattan distance = h2 - (b) Remove “X is adjacent to Y”: tile can teleport to any blank square - Gives a different heuristic (discussed in Exercise 3.GASC) - (c) Remove both conditions: tile can move anywhere in one step - Optimal solution = number of misplaced tiles = h1
This shows the relaxed problem method can automatically generate admissible heuristics if the problem is described in a formal language. The program ABSOLVER (Prieditis, 1993) does this automatically and found the first useful heuristic for Rubik’s Cube.
Critical requirement: the relaxed problem must be solvable without search (or cheaply), otherwise computing h for each node is too expensive. Both h1 and h2 for the 8-puzzle decompose into independent subproblems, making them O(1) to compute.
Combining Multiple Heuristics
If several admissible heuristics h1, …, hk are available and none dominates the others:
h(n) = max{h1(n), ..., hk(n)}
This composite heuristic: - Is admissible (max of admissibles is admissible) - Is consistent (if all h_i are consistent) - Dominates all individual heuristics
Cost: O(k) per evaluation vs. O(1). Alternative: use ML to predict which heuristic will be best, but this can produce inconsistency.
3.6.3 Pattern Databases
Motivation: subproblem solution costs can serve as admissible heuristics. For the 8-puzzle, instead of estimating per-tile costs independently, precompute exact costs for getting a subset of tiles into place.
Pattern database construction: 1. Choose a subset of tiles (e.g., tiles 1, 2, 3, 4, and blank) 2. Do a backward search from the goal state, considering only moves of those tiles 3. For each reachable configuration of those tiles, store the exact number of moves needed
This yields h_DB(state) = exact cost to solve the
subproblem, looked up in O(1).
Example for 15-puzzle: - Database for tiles {1,2,3,4,blank}: 9 × 8 × 7 × 6 × 5 = 15,120 entries - This h_DB is more accurate than Manhattan distance for many states
Multiple Pattern Databases
Build separate databases for different tile subsets (e.g., {1,2,3,4} and {5,6,7,8}). Take the maximum — still admissible. Reduces nodes generated for random 15-puzzles by factor of ~1000 compared to Manhattan distance.
Disjoint Pattern Databases
Problem with summing databases: tiles 1-4 and 5-8 share moves, so adding their costs would double-count and exceed the true cost (inadmissible).
Solution — disjoint pattern databases: count only moves that involve the designated tiles. For tiles {1,2,3,4}, count only moves where tiles 1, 2, 3, or 4 are the tile being moved. Then the costs are disjoint and can be summed (not just max-ed) while remaining admissible.
h_disjoint = (cost of moves involving tiles 1-4) + (cost of moves involving tiles 5-8)
This is still a lower bound because the two counts partition the actual moves.
Performance gain: disjoint pattern databases reduce nodes for random 15-puzzles by a factor of ~10,000 compared to Manhattan distance. For 24-puzzles, speedups of roughly 1,000,000x are possible.
Limitation: only applies when the problem decomposes so that each move affects only one subproblem (the 8/15-puzzle satisfies this because only one tile moves at a time).
3.6.4 Landmark-Based Heuristics
Motivation: large real-world graphs (e.g., road networks with millions of vertices) are too big for pattern databases. Precomputing all pairwise shortest paths requires O(|V|^2) space and O(|E|^3) time.
Landmark heuristic h_L: choose k landmark points L_1, …, L_k. Precompute C*(v, L) = exact shortest path cost from each vertex v to each landmark L. Then:
h_L(n) = min_{L in Landmarks} [C*(n, L) + C*(L, goal)]
Property: h_L is an efficient heuristic (O(k) lookup) but not admissible — if the optimal path does not pass through any landmark, h_L may overestimate.
Differential heuristic h_DH (admissible):
h_DH(n) = max_{L in Landmarks} |C*(n, L) - C*(goal, L)|
Intuition: if L is “beyond” the goal along the optimal path from n, then C(n,L) = C(n, goal) + C(goal, L), so |C(n,L) - C(goal,L)| = C(n,goal) exactly. The heuristic is admissible because triangle inequality ensures it never overestimates.
Choosing landmarks: - Random: fast, reasonable quality - Greedy spread: pick the first landmark at random, then iteratively add the point furthest from all existing landmarks - Perimeter landmarks: find centroid, arrange k pie wedges, pick the farthest vertex in each wedge - Frequency-based: use vertices frequently appearing in past search requests
Shortcuts: some route-finding algorithms predefine artificial “shortcut” edges covering entire sub-paths (e.g., between major cities), reducing the effective graph depth.
Real-world impact: combining bidirectional A* with differential heuristics and landmarks, Microsoft’s online map service could find optimal routes in a 24-million-vertex graph of the US, searching less than 0.1% of the graph.
3.6.5 Learning to Search Better
Metalevel state space: the internal computational state of a search algorithm (e.g., the current search tree in A*) can itself be treated as a state space. Each computation step (expanding a node) is an action.
Metalevel learning: a metalevel learning algorithm learns from the experience of searching many problem instances to predict which computation steps are most fruitful, minimizing the total cost of problem solving (search time + solution cost).
This is where classical search connects to reinforcement learning — Chapter 22 covers techniques for learning to minimize the total cost of problem solving by treating search as a decision problem.
3.6.6 Learning Heuristics from Experience
Instead of deriving h analytically (relaxed problems) or precomputing it (pattern databases), learn h from data.
Approach: 1. Collect (state, optimal-cost-to-goal) pairs by solving many problem instances 2. Train a function approximator (e.g., linear model, neural network) on these pairs 3. Use the learned function as h during future searches
Feature-based linear heuristic: For each state n, extract features x_1(n), x_2(n), … (e.g., number of misplaced tiles, number of pairs of tiles not adjacent in goal state). Fit:
h(n) = c_1 * x_1(n) + c_2 * x_2(n)
where c_i are learned to best fit the training data.
Tradeoff: - May produce inadmissible h (overestimates some states) → not guaranteed optimal - But can be more accurate on average than hand-crafted heuristics - Leads to tradeoff between learning time, search runtime, and solution quality
Connection to RL: the reinforcement learning methods in Chapter 22 (particularly value function approximation) are directly applicable — the value function in RL is essentially a learned heuristic.
Heuristic Domination Summary
If h2(n) >= h1(n) for all nodes n, then h2 dominates h1: - A* with h2 expands a subset of the nodes A* with h1 expands - Every node with h1(n) < C* - g(n) is surely expanded; since h2(n) >= h1(n), fewer nodes satisfy this - Generally: higher (but still admissible/consistent) heuristic = better performance
Practical principle: among admissible heuristics with similar computation cost, always prefer the one with higher values.