Monte Carlo Tree Search (MCTS)

Motivation

Alpha-beta requires a good evaluation function — hard to design for many games.

MCTS replaces the evaluation function with random simulation (playouts): play random moves until the game ends, use the outcome to estimate state value. No handcrafted features needed.

Core MCTS Loop

MCTS builds a search tree incrementally. Each iteration has four phases:

repeat until time limit:
    1. SELECTION
    2. EXPANSION
    3. SIMULATION (rollout/playout)
    4. BACKPROPAGATION
return action with highest value from root

1. Selection

Starting from root, descend the tree using a tree policy until reaching an expandable node (one with unvisited children or a terminal node).

2. Expansion

Add one (or more) new child nodes to the tree from the selected node’s unvisited actions.

3. Simulation (Rollout)

From the new node, play out the game using a default policy (usually random moves) until a terminal state. Record the outcome (win/loss/draw).

4. Backpropagation

Update statistics for all nodes on the path from the new node to the root: - N(v) ← N(v) + 1 (visit count) - W(v) ← W(v) + result (total wins)

UCB1 Tree Policy (UCT)

UCT (Upper Confidence bounds applied to Trees) — the most common tree policy:

UCB1(v) = W(v)/N(v)  +  c · √(ln(N(parent)) / N(v))

W(v)/N(v): exploitation term — average win rate from v
c · √(ln(N(parent))/N(v)): exploration term — prefer less-visited nodes
c ≈ √2: exploration constant (tunable)

Select child that maximizes UCB1 at each level during selection.

This balances: - Exploitation: choosing nodes with high empirical win rate - Exploration: visiting under-explored nodes (high uncertainty)

UCT is derived from the multi-armed bandit UCB1 algorithm applied to game trees.

Full UCT Algorithm

function UCT-SEARCH(state₀) returns action
    tree ← new tree with root node v₀ (state₀)
    while within computation budget do
        vₗ ← TREE-POLICY(tree, v₀)     -- selection + expansion
        Δ ← DEFAULT-POLICY(vₗ.state)   -- simulation
        BACKUP(vₗ, Δ)                   -- backpropagation
    return best-child(v₀, 0).action    -- exploitation only (c=0)

function TREE-POLICY(tree, v) returns node
    while v is not terminal do
        if v is not fully expanded then
            return EXPAND(v)
        else
            v ← BEST-CHILD(v, c)       -- UCB1 selection
    return v

function DEFAULT-POLICY(state) returns reward
    while state is not terminal do
        action ← random action from ACTIONS(state)
        state ← RESULT(state, action)
    return REWARD(state)

function BACKUP(v, Δ)
    while v is not null do
        N(v) ← N(v) + 1
        W(v) ← W(v) + Δ
        Δ ← 1 - Δ    -- flip reward for opponent
        v ← v.parent

Properties

Property	Value
Complete	No (stochastic)
Anytime	Yes — improves with more simulations
Eval function required	No
Domain knowledge required	Minimal (just rules)
Best for	Large branching factor; no good eval function

Convergence: as number of simulations → ∞, UCT converges to the minimax optimal action.

Why MCTS Revolutionized Go

Before MCTS (pre-2006): computer Go was at amateur level. Branching factor ~250, depth ~150 → alpha-beta completely intractable. No good eval function existed.

MCTS with random rollouts suddenly achieved near-professional level without any domain knowledge. Then neural network enhancements (AlphaGo, 2016) reached superhuman.

AlphaGo / AlphaZero: Neural MCTS

Replace random rollouts with neural network estimates: - Policy network π(a|s): probability distribution over moves → guides SELECTION (instead of UCB1 random branching) - Value network v(s): estimated win probability → partially replaces rollout

UCT with neural guidance:
PUCT(s,a) = Q(s,a) + c · π(a|s) · √(N(s)) / (1 + N(s,a))

AlphaZero: trained purely from self-play with no human knowledge → surpassed all human-designed chess/Go/shogi engines.

Connection to RL

MCTS = model-based RL (uses game simulator as a model).

UCT = UCB applied to tree search = upper confidence bound RL in disguise.

AlphaZero = MCTS (planning) + policy gradient RL (learning) — the clearest example of combining planning and learning.

For DynamICCL: MCTS could plan NCCL parameter adjustments over a short horizon using a learned simulator of network behavior.