2. Rational Agent, Standard Model, and Value Alignment

Source: AIMA 4th Ed, §1.1.3–1.1.5


The Rational Agent Framework

A rational agent is one that does the right thing — specifically, the action that maximizes its performance measure, given: 1. The performance measure (what counts as success) 2. The agent’s prior knowledge about the environment 3. The actions available to the agent 4. The agent’s percept sequence so far

Rationality ≠ omniscience. A rational agent does the best it can with the information available. It is not expected to be perfect — only to maximize expected performance.

Rational ≠ Successful


The Standard Model

The standard model of AI assumes: - We can fully specify what we want the machine to do via a well-defined objective (the performance measure / utility function). - The agent then maximizes that objective.

This has been the dominant paradigm since the 1950s and underlies nearly all of AI/ML research.

Why it works (usually)


The Value Alignment Problem

As AI systems become more powerful and are deployed in the real world, the standard model reveals a critical flaw:

We can’t always fully specify what we want.

If a machine optimizes a mis-specified objective, it may achieve that objective while violating our actual values. Classic failure modes:

The Gorilla Problem

The Proposed Solution

Instead of putting a fixed objective into a machine, design machines that: 1. Know they don’t know what we want with certainty. 2. Actively try to learn human preferences. 3. Have incentive to let humans correct them (switching off = fine, because they’re uncertain).

This is the basis for: - Assistance games (Ch. 18): formalize the human-machine interaction as a cooperative game - Inverse reinforcement learning (Ch. 22): infer human preferences from observed behavior


Summary Table

Concept Definition
Standard model AI = agent maximizing a specified performance measure
Value alignment problem Hard to specify what we actually want; misspecification leads to bad outcomes
King Midas problem Agent achieves literal objective while violating intent
Gorilla problem If ASI is created, humans may lose control of their future
Beneficial machine Machine uncertain about objectives, learns from and defers to humans
Inverse RL Learning a reward function from observed human behavior