Partially Observable Games and State of the Art

Partially Observable Games

In games with imperfect information, players cannot observe the full game state.

Examples: - Poker: hidden hole cards - Bridge: hidden hand of partner and opponents - Battleship: unknown ship positions - Kriegspiel: chess variant where you only see your own pieces

Belief States in Games

Like partial observability in search (Ch.4), we maintain a belief state — a probability distribution over possible world states consistent with the agent’s observation history.

In imperfect-information games: - Information set: the set of game states that are consistent with what the current player has observed - The player cannot distinguish between states in the same information set

Strategy: a mapping from information sets to actions (not from states to actions, as in perfect-information games).

Approaches to Imperfect Information

1. Determinization (Sampling)

Sample a “determinization” — a specific assignment to hidden variables — and run standard minimax/MCTS on that determinization. Repeat for many samples and average.

Problem: “strategy fusion” — the computed strategy may exploit information it doesn’t actually have.

Advantage: simple, widely used for card games (e.g., Bridge solvers).

2. Perfect Information Monte Carlo (PIMC)

Sample N random deals (assignments of hidden cards)
For each deal, run alpha-beta to find the best move
Choose the move that was best in the most deals

Used by: some Bridge programs, Skat solvers.

3. Counterfactual Regret Minimization (CFR)

The dominant approach for poker and similar games. Minimizes regret (difference between actual payoff and what could have been achieved).

At each information set, track:
- Regret for not having played each action
- Strategy ∝ positive regrets (regret-matching)
- Average strategy over time → Nash equilibrium

Used by: Libratus, Pluribus (superhuman poker AI).

Key result: In zero-sum two-player games, iterative CFR converges to a Nash equilibrium — a strategy pair where neither player can benefit from unilateral deviation.

4. Full Extensive-Form Game Tree

Treat imperfect information formally using the extensive form with information sets. Solve for Nash equilibrium directly using linear programming (for small games).

Poker: The AI Benchmark

Poker (Texas Hold’em) is the main benchmark for imperfect-information game AI:

System	Year	Achievement
Claudico	2015	Lost to pro heads-up players
Libratus	2017	Beat pro heads-up players (2-player)
Pluribus	2019	Beat pro 6-player no-limit Hold’em

Pluribus key insights: - Uses CFR for self-play training - Real-time search during play (not just pre-computed strategy) - Handles 6-player (not just 2-player) → non-zero-sum multi-party game

State-of-the-Art Game Programs

Game	System	Key Technology
Chess	Stockfish + Leela	MCTS + neural eval / classical alpha-beta
Go	AlphaGo/AlphaZero	MCTS + policy+value networks (self-play)
Poker	Pluribus	CFR + real-time search
Atari	DQN	Deep Q-learning (model-free RL)
StarCraft II	AlphaStar	Multi-agent RL + imitation learning
Dota 2	OpenAI Five	Proximal Policy Optimization (PPO)

Lessons for AI

Self-play + RL = incredibly powerful for well-defined games (AlphaZero)
MCTS + neural nets = handles large state spaces without eval function engineering
CFR = principled solution to imperfect information
Transfer: game-playing techniques (MCTS, TD learning, self-play) transfer to real-world sequential decision problems

Connection to RL and DynamICCL

Game AI = sequential decision making under adversarial / stochastic conditions
AlphaZero = clearest demonstration that RL (policy gradient + value learning via self-play) can exceed human performance
MCTS planning horizon = model-based RL lookahead used in Dyna architecture (Ch.8 of Sutton & Barto)
DynamICCL: partial observability of network state maps directly to imperfect-information game setting — the RL agent knows its local state but not the global distributed training context