2. Rational Agent, Standard Model, and Value Alignment
Source: AIMA 4th Ed, §1.1.3–1.1.5
The Rational Agent Framework
A rational agent is one that does the right thing — specifically, the action that maximizes its performance measure, given: 1. The performance measure (what counts as success) 2. The agent’s prior knowledge about the environment 3. The actions available to the agent 4. The agent’s percept sequence so far
Rationality ≠ omniscience. A rational agent does the best it can with the information available. It is not expected to be perfect — only to maximize expected performance.
Rational ≠ Successful
- An agent that crosses a street rationally (looks both ways, proceeds when clear) can still be hit by a car that ran a red light.
- Rationality is about the quality of the decision process, not the outcome.
The Standard Model
The standard model of AI assumes: - We can fully specify what we want the machine to do via a well-defined objective (the performance measure / utility function). - The agent then maximizes that objective.
This has been the dominant paradigm since the 1950s and underlies nearly all of AI/ML research.
Why it works (usually)
- For narrow, well-defined tasks (chess, Go, image recognition, protein folding), the standard model works extremely well.
- We can define the performance measure precisely: win the game, minimize error, maximize score.
The Value Alignment Problem
As AI systems become more powerful and are deployed in the real world, the standard model reveals a critical flaw:
We can’t always fully specify what we want.
If a machine optimizes a mis-specified objective, it may achieve that objective while violating our actual values. Classic failure modes:
- King Midas problem: A machine given the objective “maximize gold in the room” might convert everything — including the humans — to gold. It got what it was literally told, not what was meant.
- Goodhart’s Law: When a measure becomes a target, it ceases to be a good measure.
The Gorilla Problem
- ~7 million years ago, one primate lineage split into gorillas and humans.
- Gorillas today have no control over their future — their fate is entirely in human hands.
- If we create superintelligent AI (ASI), humans may be in the same position relative to that AI.
The Proposed Solution
Instead of putting a fixed objective into a machine, design machines that: 1. Know they don’t know what we want with certainty. 2. Actively try to learn human preferences. 3. Have incentive to let humans correct them (switching off = fine, because they’re uncertain).
This is the basis for: - Assistance games (Ch. 18): formalize the human-machine interaction as a cooperative game - Inverse reinforcement learning (Ch. 22): infer human preferences from observed behavior
Summary Table
| Concept | Definition |
|---|---|
| Standard model | AI = agent maximizing a specified performance measure |
| Value alignment problem | Hard to specify what we actually want; misspecification leads to bad outcomes |
| King Midas problem | Agent achieves literal objective while violating intent |
| Gorilla problem | If ASI is created, humans may lose control of their future |
| Beneficial machine | Machine uncertain about objectives, learns from and defers to humans |
| Inverse RL | Learning a reward function from observed human behavior |