6. Learning Agents
Source: AIMA 4th Ed, Chapter 2 (Section 2.4.6), physical PDF pp. 133–137
Introduction
The four agent architectures in Section 2.4 describe how agents act. But they do not explain how the agent comes into being — how it acquires its rules, models, goals, or utility function. The naive answer is: the designer programs everything in by hand. But Turing (1950) estimated how much work this would take and concluded: “Some more expeditious method seems desirable.”
The method he proposed: build learning machines and then teach them.
Key insight: Any type of agent — simple reflex, model-based, goal-based, utility-based — can be built as a learning agent. The two dimensions (architecture type and whether it learns) are orthogonal.
Why learning is necessary: 1. Allows agents to operate in initially unknown environments 2. Allows agents to become more competent than their initial knowledge alone would permit 3. Allows a single agent design to succeed across a vast variety of environments
The Four Components of a Learning Agent (Figure 2.15)
A learning agent has four conceptual components:
Performance Standard
|
v
Sensors --> [Critic] --> feedback --> [Learning Element]
| ^
changes knowledge
| |
v |
Percepts --> [Performance Element] --> [Performance Element]
|
v
Actuators
[Problem Generator] --> learning goals --> [Learning Element]
More clearly as a block diagram (Figure 2.15):
+-----------------------------------------------------------+
| AGENT |
| |
| Performance Standard |
| | |
| v Sensors <---- Environment |
| [Critic] <---- percepts ---+ |
| | | |
| feedback | |
| v | |
| [Learning Element] <-- knowledge -- [Performance Element]|
| | | |
| changes v |
| +-----------> [Performance Element] | |
| Actuators ---> Environment |
| |
| [Problem Generator] --learning goals--> [Learning Elem] |
+-----------------------------------------------------------+
1. Performance Element
The performance element is what we have previously considered to be the “whole agent” — it takes percepts as input and decides on actions. In learning agent terms, it is the execution side: the currently-known rules, models, and policies that the agent uses to act.
The performance element can be any of the four architectures: - A set of condition-action rules (simple reflex) - A transition model + rules (model-based reflex) - A search algorithm over a goal (goal-based) - A utility function + maximization (utility-based)
2. Learning Element
The learning element is responsible for making improvements to the performance element. It uses feedback from the critic and knowledge from the performance element to modify the performance element.
The design of the learning element depends entirely on the design of the performance element. When designing a learning agent, the first question is not “How am I going to make it learn?” but rather “What kind of performance element will it need to use once it has learned?”
The learning element can modify any part of the performance element: - Update condition-action rules (supervised learning on demonstrated behavior) - Update the transition model or sensor model (model learning in RL) - Update the utility function / value function (value function learning in RL)
3. Critic
The critic tells the learning element how well the agent is doing with respect to a fixed performance standard.
Why is the critic necessary? Because percepts alone provide no indication of the agent’s success. A chess program receives a percept indicating it has just been checkmated — but the percept itself does not say this is bad. The performance standard (winning chess) is external; the critic uses it to generate a feedback signal.
Crucial constraint: The performance standard must be fixed — it must be outside the learning loop. If the agent could modify its own performance standard, it could cheat by lowering the bar.
In RL, the critic’s feedback is the reward signal. The critic distinguishes part of the incoming percept as a reward (positive) or penalty (negative) that provides direct feedback on the quality of the agent’s behavior.
Example: The taxi receives no tips from passengers thoroughly shaken up during the ride. The critic informs the learning element that the loss of tips is a negative contribution to overall performance. The agent then learns that violent maneuvers are counterproductive.
Note: Hard-wired biological performance standards — pain and hunger — function exactly as critics, providing direct feedback on the quality of behavior.
4. Problem Generator
The problem generator suggests actions that will lead to new and informative experiences. This is the exploration component.
The performance element, left to itself, will always choose the action that seems best given current knowledge. But if the agent is willing to explore — to take actions that are currently suboptimal — it might discover much better actions in the long run.
Example: The taxi’s problem generator might suggest trying a new route or a new braking technique to learn whether it performs better than the currently-known approach.
Galileo analogy from the text: Galileo did not drop rocks from the Tower of Pisa because it was useful in itself. His aim was to modify his own understanding of physics — his “internal model.” The problem generator does the same for an AI agent.
The problem generator enables the classic exploration-exploitation tradeoff in RL: - Exploitation: use current best knowledge to act well now - Exploration: try new actions to potentially improve future performance
What the Learning Element Can Modify
The learning element can, in principle, modify any of the knowledge components of the agent:
| Component | What is learned | Algorithm family |
|---|---|---|
| Condition-action rules | When to apply each rule; what rules exist | Supervised learning, imitation learning |
| Transition model | How actions change world state | Model learning; system identification |
| Sensor model | How world states map to percepts | Sensor calibration; perceptual learning |
| Utility function / value function | How desirable each state is | Value function approximation in RL |
| Goals | What states are desirable | Inverse RL; preference learning |
The simplest learning case is direct learning from the percept sequence: observing pairs of successive states allows the agent to learn “What my actions do” and “How the world evolves.”
Reward and Penalty
The performance standard is communicated to the learning element as reward and penalty signals embedded in the percept sequence: - Reward: positive feedback indicating the agent’s behavior contributed to the performance measure - Penalty: negative feedback indicating behavior that detracted from performance
In RL, this is the reward function
r(s, a, s') — a scalar feedback signal provided at each
step (or at terminal states).
Human behavior as a signal: Human choices can also provide information about preferences. If the taxi blows its horn continuously and passengers cover their ears and complain, this behavior — captured as percepts — provides evidence that the behavior should be avoided. This is the basis of inverse RL and RLHF (Reinforcement Learning from Human Feedback).
The Learning Agent and RL — Direct Connection
The learning agent framework of Chapter 2 is essentially the conceptual precursor to the full RL framework introduced in Chapter 22:
| Learning Agent Component | RL Equivalent |
|---|---|
| Performance element | Policy pi(a|s) |
| Critic | Reward function R(s, a, s') |
| Learning element | Policy gradient / value update (e.g., PPO, Q-learning) |
| Problem generator | Exploration strategy (epsilon-greedy, UCB, entropy bonus) |
| Performance standard | Environment’s reward signal |
The key insight connecting them: in RL, the agent learns its performance element (policy or value function) from reward signals provided by the environment (the critic), guided by exploration strategies (the problem generator).
Summary
| Component | Role | What it modifies |
|---|---|---|
| Performance element | Act in the world using current knowledge | (Used by other components) |
| Learning element | Improve performance element from feedback | The performance element itself |
| Critic | Evaluate current behavior against fixed standard | Generates feedback for learning element |
| Problem generator | Suggest exploratory actions | Guides what experiences are collected |
Key Design Principle
“When trying to design an agent that learns a certain capability, the first question is not ‘How am I going to get it to learn this?’ but ‘What kind of performance element will my agent use to do this once it has learned how?’ Given a design for the performance element, learning mechanisms can be constructed to improve every part of the agent.”
This is the AIMA design philosophy — start with the target behavior, then design the learning mechanism around it.
Cross-References
- Section 2.4.7 → Representations used inside agents (atomic, factored, structured)
- Chapter 19 → Learning in general (supervised, unsupervised)
- Chapters 19–22 → Full learning algorithms in depth
- Chapter 22 → Reinforcement learning — the full elaboration of the learning agent framework
- Chapter 22 → Inverse RL — learning the performance measure from observed behavior
- DynamICCL → The RL agent is precisely a learning agent: performance element = NCCL parameter selection policy; critic = throughput/latency reward signal; learning element = PPO/DQN update; problem generator = epsilon-greedy or entropy-regularized exploration over the NCCL parameter space