Heuristic Evaluation Functions and Imperfect Real-Time Decisions
The Problem with Full Search
Even with alpha-beta, deep game trees are intractable. We must cut off search at a depth limit and evaluate the resulting non-terminal states.
function H-MINIMAX(game, state, cutoff-test, eval) returns action
-- Same as minimax but stops at cutoff and uses eval instead of UTILITY
-- Replace IS-TERMINAL(state) with cutoff-test(state, depth)
-- Replace UTILITY(state) with eval(state, player)
Evaluation Functions
An evaluation function maps a non-terminal game state to a numeric estimate of its utility.
Requirements: 1. Terminal states: must agree with UTILITY (exact values) 2. Non-terminal states: must approximate true minimax value 3. Efficiency: must be computable quickly (called millions of times)
Linear Weighted Features (Classic Approach)
EVAL(s) = w₁·f₁(s) + w₂·f₂(s) + ... + wₙ·fₙ(s)
Where fᵢ(s) are features (numeric properties of the state) and wᵢ are weights.
Chess example (Shannon’s material evaluation):
EVAL(s) = 9·(♛MAX - ♛MIN) + 5·(♜MAX - ♜MIN) + 3·(♝MAX + ♞MAX - ♝MIN - ♞MIN) + 1·(♟MAX - ♟MIN) + ...
Piece values: Queen=9, Rook=5, Bishop=Knight=3, Pawn=1.
Other features: pawn structure (doubled/isolated pawns), king safety, bishop pairs, rook on open file, control of center squares.
Non-Linear Evaluation
Modern programs use neural networks for evaluation: - Input: board representation (piece positions, 12×8×8 feature planes) - Output: win probability [0,1] - Trained by: self-play (AlphaZero), tablebases, human games
The Horizon Effect
A major failure mode of fixed-depth cutoff search.
Problem: catastrophic events happen just beyond the search horizon.
Example: A pawn is about to queen on move d+1 (just outside depth d). The program doesn’t see it. It evaluates the position as good for MAX, plays, and loses the queen next move.
Mitigation: 1. Quiescence search: extend search until “quiet” position (no captures, no checks). Avoids cutting off in the middle of a tactical sequence. 2. Singular extension: if one move is clearly better than all others, extend that branch.
Quiescence Search
function QUIESCENCE(state, α, β)
stand_pat ← eval(state)
if stand_pat ≥ β then return β
α ← max(α, stand_pat)
for each capture/check move a do
score ← -QUIESCENCE(RESULT(state, a), -β, -α)
if score ≥ β then return β
α ← max(α, score)
return α
Cutoff Criteria
When to stop the search and apply the eval function?
- Fixed depth: simple, predictable time
- Time budget: search until clock runs out (iterative deepening)
- Quiescent position: only stop when position is “stable”
- Beam thresholding: only explore nodes within beam of best
Learning Evaluation Functions
Instead of hand-crafting features and weights, learn them:
- Self-play: play many games, use outcomes (win/loss/draw) as training signal
- Temporal difference learning: update weights based on prediction error (Samuel’s checkers, 1959 — one of the first ML programs)
EVAL(s) targets the value of EVAL(root of minimax tree from s)
Error = EVAL(s) - (one-step minimax value of s)
Update weights to reduce error
TD-Gammon (backgammon, 1992): learned to world-class level purely from self-play. AlphaZero: neural EVAL trained purely from self-play, no human features.
Properties Summary
| Component | Role |
|---|---|
| Depth limit | Controls time budget |
| Eval function | Approximates utility at cutoff |
| Quiescence search | Extends past tactical sequences |
| Move ordering | Improves α-β pruning effectiveness |
| Transposition table | Avoids re-evaluating seen positions |
Connection to RL
Heuristic evaluation = value function in RL. Learning eval functions by self-play = temporal difference learning (TD(λ)).
AlphaZero: policy + value network trained by TD from self-play → directly unifies game-playing and RL value estimation.
For DynamICCL: the RL value function approximating expected future NCCL throughput plays the same role as a heuristic eval — estimates long-term reward from current state.