Neuroscience and RL: Neural Correlates
Chapter 15 — Neuroscience
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 369–394
Overview
Chapter 15 connects RL to neuroscience — the neural mechanisms underlying reward learning, prediction, and decision-making. The correspondence between RL algorithms and neural circuits is remarkably precise.
Dopamine as TD Error
Key finding (Montague, Dayan & Sejnowski, 1996; Schultz et al., 1997):
Dopaminergic neurons in the ventral tegmental area (VTA) and substantia nigra encode the temporal difference error δ_t:
Dopamine response:
Before conditioning: fires when US (reward) occurs
After conditioning: fires when CS occurs (predicts reward)
After conditioning: no response when reward occurs (V(CS) ≈ reward)
After conditioning: fires negatively if CS occurs but reward omitted
→ This is exactly: δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t)
Three key observations: 1. Positive δ: dopamine burst (better than expected) 2. δ ≈ 0: no dopamine change (exactly as expected) 3. Negative δ: dopamine dip/suppression (worse than expected)
This is not a reward signal — it’s a prediction error signal. The brain implements TD learning.
Reward Circuits
Nucleus Accumbens (NAcc): receives dopamine from VTA; involved in reward processing and habit formation. Corresponds to the value function V(s).
Prefrontal Cortex (PFC): goal-directed planning, working memory. Corresponds to model-based planning.
Basal Ganglia: involved in action selection, habit learning. Corresponds to actor (policy).
Amygdala: emotional valence, fear conditioning. Corresponds to reward signal R(s) and emotional salience.
The Actor-Critic in the Brain:
Critic: Nucleus Accumbens + VTA dopamine
→ computes V(s) and δ_t
Actor: Dorsal Striatum (Caudate/Putamen) + Basal Ganglia
→ stores and executes Q(s,a) → action selection
Reward Prediction and Temporal Discounting
Humans discount rewards hyperbolically (not exponentially):
Hyperbolic: V(reward at delay d) = R / (1 + kd)
Exponential (standard RL): V(reward at delay d) = R · γ^d
Hyperbolic discounting produces present bias: people prefer $100 now over $110 in a week, but prefer $110 in 52 weeks over $100 in 51 weeks — inconsistent time preferences.
RL models: standard RL uses exponential discounting (γ^d); hyperbolic is more psychologically accurate but harder to optimize (time-inconsistent preferences).
Quasi-hyperbolic discounting (Laibson): β·γ^d — adds a “β < 1” penalty on any delay: - Computationally tractable approximation to hyperbolic - More realistic for human behavior modeling
Prediction and Planning in Hippocampus
Place cells (O’Keefe & Dostrovsky, 1971): fire when animal is at specific location. The hippocampus contains a “cognitive map” of the environment.
Successor representations (Dayan, 1993):
M(s, s') = E[Σ_{t=0}^∞ γ^t 1[S_t = s'] | S_0 = s]
M(s, s’) is the expected discounted number of future visits to s’ starting from s.
TD learning of M: treat “visiting s’” as a reward → M(s, s’) learned by TD!
Value function via SR: V^π(s) = Σ_{s’} M(s, s’) r(s’)
Relationship to replaying hippocampal sequences: animals replay episodes during sleep (forward/backward replay). This mirrors experience replay in DQN — offline consolidation of TD error.
Effort and Fatigue
Effort discounting: organisms discount rewards by the effort required to obtain them:
V(reward requiring effort e) = R · D(e)
where D(e) < 1 for e > 0. High effort → discounted value.
In RL: equivalent to adding a “fatigue cost” to the reward signal:
r_effective(s, a) = r(s, a) - c(effort(s, a))
This shapes the policy to prefer efficient actions — important for motor control and robot locomotion.
Intrinsic Motivation and Curiosity
Intrinsic reward: reward that doesn’t come from the environment’s extrinsic reward signal, but from the agent’s internal state.
Types: 1. Novelty: reward for visiting new states 2. Surprise: reward for observing unexpected events (high δ) 3. Competence/mastery: reward for improving prediction accuracy 4. Information gain: reward for reducing uncertainty
Computational models:
Novelty: r_intrinsic = 1 / √(count(s)) ← inverse square root of visit count
Curiosity (ICM): r_intrinsic = ||f(s) - f̂(s)||² ← prediction error of learned model
RND (Random Network Distillation): r_intrinsic = ||e(s) - ê(s)||²
These augment the extrinsic reward r_t → r_t + β r_intrinsic_t.
Application: when extrinsic reward is sparse (e.g., exploring a maze with goal at the end), intrinsic motivation enables exploration.
Neuromodulators Beyond Dopamine
| Neuromodulator | RL Correspondence | Function |
|---|---|---|
| Dopamine | TD error δ_t | Reward prediction error |
| Serotonin | Discount factor γ or patience | Regulates temporal discounting |
| Norepinephrine | Exploration ε | Attention, uncertainty, arousal |
| Acetylcholine | Learning rate α | Promotes plasticity when novel |
Exploitation-Exploration tradeoff regulated by norepinephrine — corresponds to ε in ε-greedy.
Chapter 15 Summary
- Dopamine = TD error: one of the strongest connections between RL and neuroscience
- Brain implements actor-critic: basal ganglia (actor) + dopaminergic VTA/NAcc (critic)
- Hippocampus = cognitive map: successor representation + offline replay
- Hyperbolic discounting: more psychologically accurate than exponential
- Intrinsic motivation: biologically plausible, computationally important for sparse rewards
- Neuromodulators: each maps to an RL hyperparameter (α, γ, ε)
Connection to DynamICCL
- DynamICCL’s TD error δ_t (throughput surprise) ↔︎ dopamine signal in reward circuits
- Exploration in NCCL config space ↔︎ norepinephrine-regulated arousal
- Adaptive learning rate (cosine annealing in PPO) ↔︎ acetylcholine-modulated plasticity
- Intrinsic curiosity for NCCL: reward for trying novel configurations (count-based UCB) → helps escape local optima