Psychology and RL: Animal Learning Connections
Chapter 14 — Psychology
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 349–368
Overview
Chapter 14 draws formal connections between RL algorithms and animal/human learning theory. The correspondence is not metaphorical — many RL algorithms were inspired by or independently rediscovered phenomena from behavioral psychology.
Classical Conditioning: The Rescorla-Wagner Rule
Pavlovian conditioning: animal learns to associate a neutral stimulus (CS: conditioned stimulus) with a reward (US: unconditioned stimulus).
Rescorla-Wagner rule (1972): prediction error drives learning:
V_{t+1}(CS) ← V_{t+1}(CS) + α (λ_US - Σ_i V(CS_i))
where λ_US = reward value of US, V(CS) = predicted reward for CS, α = learning rate.
This is exactly the TD(0) update:
V(s) ← V(s) + α [R - V(s)]
The “prediction error” (λ - V) = the TD error δ.
Phenomena the Rescorla-Wagner rule explains: 1. Acquisition: repeated CS-US pairings → V(CS) → λ_US 2. Extinction: CS without US → prediction error negative → V(CS) decreases 3. Blocking: if CS1 already predicts US fully, adding CS2 → no learning for CS2 (already V(CS1+CS2) = λ_US → error ≈ 0)
What Rescorla-Wagner doesn’t explain (requires TD): - Temporal credit assignment: when does CS occur vs when does US occur? - Prediction of predictions: higher-order conditioning - Timing: animal learns the delay between CS and US
Temporal Credit Assignment: TD vs Rescorla-Wagner
Second-order conditioning: CS2 → CS1 → US. Animal learns CS2 → reward even without direct CS2-US pairing.
Rescorla-Wagner fails: CS2-US association never directly experienced.
TD explains: CS1 predicts US → V(CS1) = reward. CS2 appears before CS1 → TD error when CS1 appears → V(CS2) updated.
This is the bootstrap nature of TD: predictions of predictions are chained backward through time.
Operant Conditioning: Reinforcement Schedules
Operant conditioning: animal learns to take actions to receive rewards (Skinner, 1930s).
| Schedule | Description | RL analog |
|---|---|---|
| Fixed ratio (FR-n) | Reward after n responses | Dense reward every n steps |
| Variable ratio (VR-n) | Reward after ~n responses (random) | Stochastic reward |
| Fixed interval (FI) | Reward first response after T seconds | Time-based reward |
| Variable interval (VI) | Reward first response after ~T seconds | Random time reward |
VR schedules produce highest response rates (like gambling machines). In RL terms: variable ratio → highest expected reward per unit time if optimal policy is maintained.
TD and the Actor-Critic: Dual-Process Theory
Animal learning has two systems: 1. Habitual system: stimulus → action (fast, automatic, model-free) 2. Goal-directed system: consider consequences of actions (slow, deliberate, model-based)
RL analogy: - Model-free RL (Q-learning, SARSA) = habitual system: stimulus → learned response - Model-based RL (Dyna, planning) = goal-directed system: explicit consequence evaluation
Devaluation experiments distinguish the two: - After learning, devalue the reward (make food aversive). - Goal-directed agent: immediately stops seeking food - Habitual agent: continues old behavior (no model → can’t infer devaluation effect)
This directly motivates the Dyna architecture: combining model-free (fast) and model-based (flexible) components.
Attention and Latent Inhibition
Latent inhibition: pre-exposing CS without US (before conditioning) slows subsequent learning.
Explanation: attention α(CS) is reduced after exposure without consequence:
Attention update: α_t(CS) ← α_{t-1}(CS) · (1 - |δ_t|) + surprise bonus
High |δ| (surprise) → high attention → fast learning. Low |δ| (predictable) → low attention → slow learning.
RL connection: this is an adaptive learning rate mechanism: - α(s) proportional to how surprising state s is - Implements a form of “active attention” — focus on informative experiences
Similar to the intrinsic motivation / curiosity-driven exploration in modern RL (count-based curiosity, ICM).
Learned Helplessness
Experimental: animal exposed to unavoidable shocks → later, when shocks become avoidable, animal doesn’t try to escape (“gives up”).
RL explanation: the Q-value for “try to escape” is estimated very low because in the past, trying to escape never worked. With off-policy learning, escaping actions would update Q correctly; but on-policy learning never tries escape → Q(escape) stays low → never tried → never updated.
Computational model: this is an exploration failure — the agent gets stuck in a local optimum because it never explores enough.
Implications for DynamICCL: if NCCL exploration is too limited, the agent may learn “it doesn’t matter which algorithm we pick” (poor exploration → all configs seem equally bad) and fail to find the true optimum.
Chapter 14 Summary
- TD learning = formal model of classical conditioning (Rescorla-Wagner as special case)
- TD error = dopamine prediction error signal (next chapter)
- Actor-critic = dual-process theory (habitual vs goal-directed)
- Attention mechanisms = adaptive learning rate (latent inhibition, surprise-based)
- Exploration failure = learned helplessness in psychology
- RL provides mechanistic explanations for phenomena that behavioral psychology only described
Connection to DynamICCL
- TD error in DynamICCL ↔︎ dopamine-like “surprise signal” about NCCL throughput
- Model-free PPO ↔︎ habitual system (fast, automatic config selection)
- Model-based extension (Dyna-PPO) ↔︎ goal-directed system (simulated planning)
- Adaptive exploration ↔︎ UCB/entropy bonus to avoid “learned helplessness” in NCCL space