6. Learning Agents

Source: AIMA 4th Ed, Chapter 2 (Section 2.4.6), physical PDF pp. 133–137

Introduction

The four agent architectures in Section 2.4 describe how agents act. But they do not explain how the agent comes into being — how it acquires its rules, models, goals, or utility function. The naive answer is: the designer programs everything in by hand. But Turing (1950) estimated how much work this would take and concluded: “Some more expeditious method seems desirable.”

The method he proposed: build learning machines and then teach them.

Key insight: Any type of agent — simple reflex, model-based, goal-based, utility-based — can be built as a learning agent. The two dimensions (architecture type and whether it learns) are orthogonal.

Why learning is necessary: 1. Allows agents to operate in initially unknown environments 2. Allows agents to become more competent than their initial knowledge alone would permit 3. Allows a single agent design to succeed across a vast variety of environments

The Four Components of a Learning Agent (Figure 2.15)

A learning agent has four conceptual components:

                        Performance Standard
                               |
                               v
  Sensors --> [Critic] --> feedback --> [Learning Element]
                                              |         ^
                                           changes    knowledge
                                              |         |
                                              v         |
  Percepts --> [Performance Element] --> [Performance Element]
                                              |
                                              v
                                          Actuators

              [Problem Generator] --> learning goals --> [Learning Element]

More clearly as a block diagram (Figure 2.15):

  +-----------------------------------------------------------+
  |                         AGENT                             |
  |                                                           |
  |  Performance Standard                                     |
  |         |                                                 |
  |         v                    Sensors <---- Environment    |
  |      [Critic] <---- percepts ---+                        |
  |         |                       |                        |
  |      feedback                   |                        |
  |         v                       |                        |
  |  [Learning Element] <-- knowledge -- [Performance Element]|
  |         |                                    |            |
  |      changes                                 v            |
  |         +-----------> [Performance Element]  |            |
  |                                Actuators ---> Environment |
  |                                                           |
  |  [Problem Generator] --learning goals--> [Learning Elem] |
  +-----------------------------------------------------------+

1. Performance Element

The performance element is what we have previously considered to be the “whole agent” — it takes percepts as input and decides on actions. In learning agent terms, it is the execution side: the currently-known rules, models, and policies that the agent uses to act.

The performance element can be any of the four architectures: - A set of condition-action rules (simple reflex) - A transition model + rules (model-based reflex) - A search algorithm over a goal (goal-based) - A utility function + maximization (utility-based)

2. Learning Element

The learning element is responsible for making improvements to the performance element. It uses feedback from the critic and knowledge from the performance element to modify the performance element.

The design of the learning element depends entirely on the design of the performance element. When designing a learning agent, the first question is not “How am I going to make it learn?” but rather “What kind of performance element will it need to use once it has learned?”

The learning element can modify any part of the performance element: - Update condition-action rules (supervised learning on demonstrated behavior) - Update the transition model or sensor model (model learning in RL) - Update the utility function / value function (value function learning in RL)

3. Critic

The critic tells the learning element how well the agent is doing with respect to a fixed performance standard.

Why is the critic necessary? Because percepts alone provide no indication of the agent’s success. A chess program receives a percept indicating it has just been checkmated — but the percept itself does not say this is bad. The performance standard (winning chess) is external; the critic uses it to generate a feedback signal.

Crucial constraint: The performance standard must be fixed — it must be outside the learning loop. If the agent could modify its own performance standard, it could cheat by lowering the bar.

In RL, the critic’s feedback is the reward signal. The critic distinguishes part of the incoming percept as a reward (positive) or penalty (negative) that provides direct feedback on the quality of the agent’s behavior.

Example: The taxi receives no tips from passengers thoroughly shaken up during the ride. The critic informs the learning element that the loss of tips is a negative contribution to overall performance. The agent then learns that violent maneuvers are counterproductive.

Note: Hard-wired biological performance standards — pain and hunger — function exactly as critics, providing direct feedback on the quality of behavior.

4. Problem Generator

The problem generator suggests actions that will lead to new and informative experiences. This is the exploration component.

The performance element, left to itself, will always choose the action that seems best given current knowledge. But if the agent is willing to explore — to take actions that are currently suboptimal — it might discover much better actions in the long run.

Example: The taxi’s problem generator might suggest trying a new route or a new braking technique to learn whether it performs better than the currently-known approach.

Galileo analogy from the text: Galileo did not drop rocks from the Tower of Pisa because it was useful in itself. His aim was to modify his own understanding of physics — his “internal model.” The problem generator does the same for an AI agent.

The problem generator enables the classic exploration-exploitation tradeoff in RL: - Exploitation: use current best knowledge to act well now - Exploration: try new actions to potentially improve future performance

What the Learning Element Can Modify

The learning element can, in principle, modify any of the knowledge components of the agent:

Component	What is learned	Algorithm family
Condition-action rules	When to apply each rule; what rules exist	Supervised learning, imitation learning
Transition model	How actions change world state	Model learning; system identification
Sensor model	How world states map to percepts	Sensor calibration; perceptual learning
Utility function / value function	How desirable each state is	Value function approximation in RL
Goals	What states are desirable	Inverse RL; preference learning

The simplest learning case is direct learning from the percept sequence: observing pairs of successive states allows the agent to learn “What my actions do” and “How the world evolves.”

Reward and Penalty

The performance standard is communicated to the learning element as reward and penalty signals embedded in the percept sequence: - Reward: positive feedback indicating the agent’s behavior contributed to the performance measure - Penalty: negative feedback indicating behavior that detracted from performance

In RL, this is the reward function r(s, a, s') — a scalar feedback signal provided at each step (or at terminal states).

Human behavior as a signal: Human choices can also provide information about preferences. If the taxi blows its horn continuously and passengers cover their ears and complain, this behavior — captured as percepts — provides evidence that the behavior should be avoided. This is the basis of inverse RL and RLHF (Reinforcement Learning from Human Feedback).

The Learning Agent and RL — Direct Connection

The learning agent framework of Chapter 2 is essentially the conceptual precursor to the full RL framework introduced in Chapter 22:

Learning Agent Component	RL Equivalent
Performance element	Policy `pi(a\|s)`
Critic	Reward function `R(s, a, s')`
Learning element	Policy gradient / value update (e.g., PPO, Q-learning)
Problem generator	Exploration strategy (epsilon-greedy, UCB, entropy bonus)
Performance standard	Environment’s reward signal

The key insight connecting them: in RL, the agent learns its performance element (policy or value function) from reward signals provided by the environment (the critic), guided by exploration strategies (the problem generator).

Summary

Component	Role	What it modifies
Performance element	Act in the world using current knowledge	(Used by other components)
Learning element	Improve performance element from feedback	The performance element itself
Critic	Evaluate current behavior against fixed standard	Generates feedback for learning element
Problem generator	Suggest exploratory actions	Guides what experiences are collected

Key Design Principle

“When trying to design an agent that learns a certain capability, the first question is not ‘How am I going to get it to learn this?’ but ‘What kind of performance element will my agent use to do this once it has learned how?’ Given a design for the performance element, learning mechanisms can be constructed to improve every part of the agent.”

This is the AIMA design philosophy — start with the target behavior, then design the learning mechanism around it.

Cross-References

Section 2.4.7 → Representations used inside agents (atomic, factored, structured)
Chapter 19 → Learning in general (supervised, unsupervised)
Chapters 19–22 → Full learning algorithms in depth
Chapter 22 → Reinforcement learning — the full elaboration of the learning agent framework
Chapter 22 → Inverse RL — learning the performance measure from observed behavior
DynamICCL → The RL agent is precisely a learning agent: performance element = NCCL parameter selection policy; critic = throughput/latency reward signal; learning element = PPO/DQN update; problem generator = epsilon-greedy or entropy-regularized exploration over the NCCL parameter space