2. Rational Agent, Standard Model, and Value Alignment

Source: AIMA 4th Ed, §1.1.3–1.1.5

The Rational Agent Framework

A rational agent is one that does the right thing — specifically, the action that maximizes its performance measure, given: 1. The performance measure (what counts as success) 2. The agent’s prior knowledge about the environment 3. The actions available to the agent 4. The agent’s percept sequence so far

Rationality ≠ omniscience. A rational agent does the best it can with the information available. It is not expected to be perfect — only to maximize expected performance.

Rational ≠ Successful

An agent that crosses a street rationally (looks both ways, proceeds when clear) can still be hit by a car that ran a red light.
Rationality is about the quality of the decision process, not the outcome.

The Standard Model

The standard model of AI assumes: - We can fully specify what we want the machine to do via a well-defined objective (the performance measure / utility function). - The agent then maximizes that objective.

This has been the dominant paradigm since the 1950s and underlies nearly all of AI/ML research.

Why it works (usually)

For narrow, well-defined tasks (chess, Go, image recognition, protein folding), the standard model works extremely well.
We can define the performance measure precisely: win the game, minimize error, maximize score.

The Value Alignment Problem

As AI systems become more powerful and are deployed in the real world, the standard model reveals a critical flaw:

We can’t always fully specify what we want.

If a machine optimizes a mis-specified objective, it may achieve that objective while violating our actual values. Classic failure modes:

King Midas problem: A machine given the objective “maximize gold in the room” might convert everything — including the humans — to gold. It got what it was literally told, not what was meant.
Goodhart’s Law: When a measure becomes a target, it ceases to be a good measure.

The Gorilla Problem

~7 million years ago, one primate lineage split into gorillas and humans.
Gorillas today have no control over their future — their fate is entirely in human hands.
If we create superintelligent AI (ASI), humans may be in the same position relative to that AI.

The Proposed Solution

Instead of putting a fixed objective into a machine, design machines that: 1. Know they don’t know what we want with certainty. 2. Actively try to learn human preferences. 3. Have incentive to let humans correct them (switching off = fine, because they’re uncertain).

This is the basis for: - Assistance games (Ch. 18): formalize the human-machine interaction as a cooperative game - Inverse reinforcement learning (Ch. 22): infer human preferences from observed behavior

Summary Table

Concept	Definition
Standard model	AI = agent maximizing a specified performance measure
Value alignment problem	Hard to specify what we actually want; misspecification leads to bad outcomes
King Midas problem	Agent achieves literal objective while violating intent
Gorilla problem	If ASI is created, humans may lose control of their future
Beneficial machine	Machine uncertain about objectives, learns from and defers to humans
Inverse RL	Learning a reward function from observed human behavior