Partially Observable Games and State of the Art
Partially Observable Games
In games with imperfect information, players cannot observe the full game state.
Examples: - Poker: hidden hole cards - Bridge: hidden hand of partner and opponents - Battleship: unknown ship positions - Kriegspiel: chess variant where you only see your own pieces
Belief States in Games
Like partial observability in search (Ch.4), we maintain a belief state — a probability distribution over possible world states consistent with the agent’s observation history.
In imperfect-information games: - Information set: the set of game states that are consistent with what the current player has observed - The player cannot distinguish between states in the same information set
Strategy: a mapping from information sets to actions (not from states to actions, as in perfect-information games).
Approaches to Imperfect Information
1. Determinization (Sampling)
Sample a “determinization” — a specific assignment to hidden variables — and run standard minimax/MCTS on that determinization. Repeat for many samples and average.
Problem: “strategy fusion” — the computed strategy may exploit information it doesn’t actually have.
Advantage: simple, widely used for card games (e.g., Bridge solvers).
2. Perfect Information Monte Carlo (PIMC)
- Sample N random deals (assignments of hidden cards)
- For each deal, run alpha-beta to find the best move
- Choose the move that was best in the most deals
Used by: some Bridge programs, Skat solvers.
3. Counterfactual Regret Minimization (CFR)
The dominant approach for poker and similar games. Minimizes regret (difference between actual payoff and what could have been achieved).
At each information set, track:
- Regret for not having played each action
- Strategy ∝ positive regrets (regret-matching)
- Average strategy over time → Nash equilibrium
Used by: Libratus, Pluribus (superhuman poker AI).
Key result: In zero-sum two-player games, iterative CFR converges to a Nash equilibrium — a strategy pair where neither player can benefit from unilateral deviation.
4. Full Extensive-Form Game Tree
Treat imperfect information formally using the extensive form with information sets. Solve for Nash equilibrium directly using linear programming (for small games).
Poker: The AI Benchmark
Poker (Texas Hold’em) is the main benchmark for imperfect-information game AI:
| System | Year | Achievement |
|---|---|---|
| Claudico | 2015 | Lost to pro heads-up players |
| Libratus | 2017 | Beat pro heads-up players (2-player) |
| Pluribus | 2019 | Beat pro 6-player no-limit Hold’em |
Pluribus key insights: - Uses CFR for self-play training - Real-time search during play (not just pre-computed strategy) - Handles 6-player (not just 2-player) → non-zero-sum multi-party game
State-of-the-Art Game Programs
| Game | System | Key Technology |
|---|---|---|
| Chess | Stockfish + Leela | MCTS + neural eval / classical alpha-beta |
| Go | AlphaGo/AlphaZero | MCTS + policy+value networks (self-play) |
| Poker | Pluribus | CFR + real-time search |
| Atari | DQN | Deep Q-learning (model-free RL) |
| StarCraft II | AlphaStar | Multi-agent RL + imitation learning |
| Dota 2 | OpenAI Five | Proximal Policy Optimization (PPO) |
Lessons for AI
- Self-play + RL = incredibly powerful for well-defined games (AlphaZero)
- MCTS + neural nets = handles large state spaces without eval function engineering
- CFR = principled solution to imperfect information
- Transfer: game-playing techniques (MCTS, TD learning, self-play) transfer to real-world sequential decision problems
Connection to RL and DynamICCL
- Game AI = sequential decision making under adversarial / stochastic conditions
- AlphaZero = clearest demonstration that RL (policy gradient + value learning via self-play) can exceed human performance
- MCTS planning horizon = model-based RL lookahead used in Dyna architecture (Ch.8 of Sutton & Barto)
- DynamICCL: partial observability of network state maps directly to imperfect-information game setting — the RL agent knows its local state but not the global distributed training context