Psychology and RL: Animal Learning Connections

Chapter 14 — Psychology

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 349–368

Overview

Chapter 14 draws formal connections between RL algorithms and animal/human learning theory. The correspondence is not metaphorical — many RL algorithms were inspired by or independently rediscovered phenomena from behavioral psychology.

Classical Conditioning: The Rescorla-Wagner Rule

Pavlovian conditioning: animal learns to associate a neutral stimulus (CS: conditioned stimulus) with a reward (US: unconditioned stimulus).

Rescorla-Wagner rule (1972): prediction error drives learning:

V_{t+1}(CS) ← V_{t+1}(CS) + α (λ_US - Σ_i V(CS_i))

where λ_US = reward value of US, V(CS) = predicted reward for CS, α = learning rate.

This is exactly the TD(0) update:

V(s) ← V(s) + α [R - V(s)]

The “prediction error” (λ - V) = the TD error δ.

Phenomena the Rescorla-Wagner rule explains: 1. Acquisition: repeated CS-US pairings → V(CS) → λ_US 2. Extinction: CS without US → prediction error negative → V(CS) decreases 3. Blocking: if CS1 already predicts US fully, adding CS2 → no learning for CS2 (already V(CS1+CS2) = λ_US → error ≈ 0)

What Rescorla-Wagner doesn’t explain (requires TD): - Temporal credit assignment: when does CS occur vs when does US occur? - Prediction of predictions: higher-order conditioning - Timing: animal learns the delay between CS and US

Temporal Credit Assignment: TD vs Rescorla-Wagner

Second-order conditioning: CS2 → CS1 → US. Animal learns CS2 → reward even without direct CS2-US pairing.

Rescorla-Wagner fails: CS2-US association never directly experienced.

TD explains: CS1 predicts US → V(CS1) = reward. CS2 appears before CS1 → TD error when CS1 appears → V(CS2) updated.

This is the bootstrap nature of TD: predictions of predictions are chained backward through time.

Operant Conditioning: Reinforcement Schedules

Operant conditioning: animal learns to take actions to receive rewards (Skinner, 1930s).

Schedule	Description	RL analog
Fixed ratio (FR-n)	Reward after n responses	Dense reward every n steps
Variable ratio (VR-n)	Reward after ~n responses (random)	Stochastic reward
Fixed interval (FI)	Reward first response after T seconds	Time-based reward
Variable interval (VI)	Reward first response after ~T seconds	Random time reward

VR schedules produce highest response rates (like gambling machines). In RL terms: variable ratio → highest expected reward per unit time if optimal policy is maintained.

TD and the Actor-Critic: Dual-Process Theory

Animal learning has two systems: 1. Habitual system: stimulus → action (fast, automatic, model-free) 2. Goal-directed system: consider consequences of actions (slow, deliberate, model-based)

RL analogy: - Model-free RL (Q-learning, SARSA) = habitual system: stimulus → learned response - Model-based RL (Dyna, planning) = goal-directed system: explicit consequence evaluation

Devaluation experiments distinguish the two: - After learning, devalue the reward (make food aversive). - Goal-directed agent: immediately stops seeking food - Habitual agent: continues old behavior (no model → can’t infer devaluation effect)

This directly motivates the Dyna architecture: combining model-free (fast) and model-based (flexible) components.

Attention and Latent Inhibition

Latent inhibition: pre-exposing CS without US (before conditioning) slows subsequent learning.

Explanation: attention α(CS) is reduced after exposure without consequence:

Attention update: α_t(CS) ← α_{t-1}(CS) · (1 - |δ_t|) + surprise bonus

High |δ| (surprise) → high attention → fast learning. Low |δ| (predictable) → low attention → slow learning.

RL connection: this is an adaptive learning rate mechanism: - α(s) proportional to how surprising state s is - Implements a form of “active attention” — focus on informative experiences

Similar to the intrinsic motivation / curiosity-driven exploration in modern RL (count-based curiosity, ICM).

Learned Helplessness

Experimental: animal exposed to unavoidable shocks → later, when shocks become avoidable, animal doesn’t try to escape (“gives up”).

RL explanation: the Q-value for “try to escape” is estimated very low because in the past, trying to escape never worked. With off-policy learning, escaping actions would update Q correctly; but on-policy learning never tries escape → Q(escape) stays low → never tried → never updated.

Computational model: this is an exploration failure — the agent gets stuck in a local optimum because it never explores enough.

Implications for DynamICCL: if NCCL exploration is too limited, the agent may learn “it doesn’t matter which algorithm we pick” (poor exploration → all configs seem equally bad) and fail to find the true optimum.

Chapter 14 Summary

TD learning = formal model of classical conditioning (Rescorla-Wagner as special case)
TD error = dopamine prediction error signal (next chapter)
Actor-critic = dual-process theory (habitual vs goal-directed)
Attention mechanisms = adaptive learning rate (latent inhibition, surprise-based)
Exploration failure = learned helplessness in psychology
RL provides mechanistic explanations for phenomena that behavioral psychology only described

Connection to DynamICCL

TD error in DynamICCL ↔︎ dopamine-like “surprise signal” about NCCL throughput
Model-free PPO ↔︎ habitual system (fast, automatic config selection)
Model-based extension (Dyna-PPO) ↔︎ goal-directed system (simulated planning)
Adaptive exploration ↔︎ UCB/entropy bonus to avoid “learned helplessness” in NCCL space