Multiagent Decision Making
Chapter 18 — Multiagent Decision Making Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 610–651
Why Multiple Agents?
Single-agent AI assumes one decision-maker controlling the world. In reality: - Multiple autonomous agents interact (robots, economic agents, distributed AI) - Some agents have aligned goals (cooperation) - Some have opposing goals (competition) - Most situations are mixed (partial conflict)
Game theory provides the mathematical framework.
Normal-Form Games
A normal-form game is defined by: - N players - Action sets A₁, …, Aₙ - Utility functions U₁, …, Uₙ mapping joint actions → real numbers
Example: Prisoner’s Dilemma
| Cooperate | Defect | |
|---|---|---|
| Cooperate | (3, 3) | (0, 5) |
| Defect | (5, 0) | (1, 1) |
Row = Player 1, Column = Player 2, (Player 1 utility, Player 2 utility).
Each player wants to defect regardless of the other’s action → mutual defection (1,1) is the outcome, even though (3,3) is better for both.
Nash Equilibrium
A Nash equilibrium (NE) is a joint strategy (σ₁,…,σₙ) where no player can improve their utility by unilaterally changing their strategy:
∀i, ∀σᵢ': Uᵢ(σ₁,...,σᵢ,...,σₙ) ≥ Uᵢ(σ₁,...,σᵢ',...,σₙ)
Prisoner’s Dilemma: (Defect, Defect) is the unique NE — despite being suboptimal.
Nash existence theorem: every finite game has at least one Nash equilibrium (possibly in mixed strategies).
Mixed Strategy NE
Players randomize. Example: Rock-Paper-Scissors: - Unique NE: play each with probability 1/3 - Any pure strategy is exploitable
Dominant Strategies
Dominant strategy: action a dominates a’ if it gives strictly higher utility regardless of others’ actions.
If all players have dominant strategies, the game outcome is determined independently of rationality beliefs about others.
Prisoner’s Dilemma: “Defect” dominates “Cooperate” for both players → unique dominant strategy NE.
Pareto Optimality
A joint outcome is Pareto optimal if no agent can be made better off without making another worse off.
In Prisoner’s Dilemma: (C,C) is Pareto optimal; (D,D) is NOT (both could switch to (C,C) to improve).
Social dilemma: individually rational behavior (NE) leads to Pareto-suboptimal outcome.
Mechanism Design (Reverse Game Theory)
Mechanism design: design the game rules so that selfish agents’ equilibrium behavior achieves a desired social outcome.
Goal: create incentives so that truth-telling or cooperation is the dominant strategy.
Example: VCG auction (Vickrey-Clarke-Groves) - Each agent reports their value for goods - Allocation maximizes social welfare - Each agent pays = externality they impose on others - Dominant strategy: report true values (truthful mechanism)
Applications: auction design, voting rules, cost sharing.
Repeated Games
In repeated games, players interact over multiple rounds. Future punishments enable cooperation even in social dilemmas.
Folk theorem: in infinitely repeated games with sufficient patience (high γ), any Pareto-optimal outcome can be sustained as a Nash equilibrium.
Tit-for-tat: cooperate initially; mimic opponent’s last action. Very effective in repeated Prisoner’s Dilemma.
Multiagent RL (MARL)
Combining multiple agents with RL:
Cooperative MARL
All agents share a reward → maximize joint reward. - Centralized training, decentralized execution (CTDE): train with full info, deploy with local info - QMIX, MAPPO: modern cooperative MARL algorithms
Competitive MARL
Zero-sum: one agent’s gain = other’s loss. - Minimax Q-learning: Q*(s,a,b) = min_b max_a … - Self-play: both agents train against each other (AlphaZero)
Mixed (General-Sum)
Most practical settings. No single optimal solution — Nash equilibrium as target. - Nash Q-learning: compute Nash at each step (computationally expensive) - Mean-field RL: approximate many agents by mean-field interaction
Connection to DynamICCL
- Multiple GPUs/nodes in distributed training = multiagent system
- Each node can independently observe local metrics and make NCCL parameter decisions
- Cooperative MARL: nodes share a reward (total throughput) → CTDE suitable
- Communication topology (ring, tree, mesh) = the “game” structure determining agent interactions
- Mean-field approximation: each node interacts with the average behavior of all others — reduces MAS complexity