Monte Carlo Tree Search (MCTS)
Motivation
Alpha-beta requires a good evaluation function — hard to design for many games.
MCTS replaces the evaluation function with random simulation (playouts): play random moves until the game ends, use the outcome to estimate state value. No handcrafted features needed.
Core MCTS Loop
MCTS builds a search tree incrementally. Each iteration has four phases:
repeat until time limit:
1. SELECTION
2. EXPANSION
3. SIMULATION (rollout/playout)
4. BACKPROPAGATION
return action with highest value from root
1. Selection
Starting from root, descend the tree using a tree policy until reaching an expandable node (one with unvisited children or a terminal node).
2. Expansion
Add one (or more) new child nodes to the tree from the selected node’s unvisited actions.
3. Simulation (Rollout)
From the new node, play out the game using a default policy (usually random moves) until a terminal state. Record the outcome (win/loss/draw).
4. Backpropagation
Update statistics for all nodes on the path from the new node to the root: - N(v) ← N(v) + 1 (visit count) - W(v) ← W(v) + result (total wins)
UCB1 Tree Policy (UCT)
UCT (Upper Confidence bounds applied to Trees) — the most common tree policy:
UCB1(v) = W(v)/N(v) + c · √(ln(N(parent)) / N(v))
- W(v)/N(v): exploitation term — average win rate from v
- c · √(ln(N(parent))/N(v)): exploration term — prefer less-visited nodes
- c ≈ √2: exploration constant (tunable)
Select child that maximizes UCB1 at each level during selection.
This balances: - Exploitation: choosing nodes with high empirical win rate - Exploration: visiting under-explored nodes (high uncertainty)
UCT is derived from the multi-armed bandit UCB1 algorithm applied to game trees.
Full UCT Algorithm
function UCT-SEARCH(state₀) returns action
tree ← new tree with root node v₀ (state₀)
while within computation budget do
vₗ ← TREE-POLICY(tree, v₀) -- selection + expansion
Δ ← DEFAULT-POLICY(vₗ.state) -- simulation
BACKUP(vₗ, Δ) -- backpropagation
return best-child(v₀, 0).action -- exploitation only (c=0)
function TREE-POLICY(tree, v) returns node
while v is not terminal do
if v is not fully expanded then
return EXPAND(v)
else
v ← BEST-CHILD(v, c) -- UCB1 selection
return v
function DEFAULT-POLICY(state) returns reward
while state is not terminal do
action ← random action from ACTIONS(state)
state ← RESULT(state, action)
return REWARD(state)
function BACKUP(v, Δ)
while v is not null do
N(v) ← N(v) + 1
W(v) ← W(v) + Δ
Δ ← 1 - Δ -- flip reward for opponent
v ← v.parent
Properties
| Property | Value |
|---|---|
| Complete | No (stochastic) |
| Anytime | Yes — improves with more simulations |
| Eval function required | No |
| Domain knowledge required | Minimal (just rules) |
| Best for | Large branching factor; no good eval function |
Convergence: as number of simulations → ∞, UCT converges to the minimax optimal action.
Why MCTS Revolutionized Go
Before MCTS (pre-2006): computer Go was at amateur level. Branching factor ~250, depth ~150 → alpha-beta completely intractable. No good eval function existed.
MCTS with random rollouts suddenly achieved near-professional level without any domain knowledge. Then neural network enhancements (AlphaGo, 2016) reached superhuman.
AlphaGo / AlphaZero: Neural MCTS
Replace random rollouts with neural network estimates: - Policy network π(a|s): probability distribution over moves → guides SELECTION (instead of UCB1 random branching) - Value network v(s): estimated win probability → partially replaces rollout
UCT with neural guidance:
PUCT(s,a) = Q(s,a) + c · π(a|s) · √(N(s)) / (1 + N(s,a))
AlphaZero: trained purely from self-play with no human knowledge → surpassed all human-designed chess/Go/shogi engines.
Connection to RL
MCTS = model-based RL (uses game simulator as a model).
UCT = UCB applied to tree search = upper confidence bound RL in disguise.
AlphaZero = MCTS (planning) + policy gradient RL (learning) — the clearest example of combining planning and learning.
For DynamICCL: MCTS could plan NCCL parameter adjustments over a short horizon using a learned simulator of network behavior.