Neuroscience and RL: Neural Correlates

Chapter 15 — Neuroscience

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 369–394

Overview

Chapter 15 connects RL to neuroscience — the neural mechanisms underlying reward learning, prediction, and decision-making. The correspondence between RL algorithms and neural circuits is remarkably precise.

Dopamine as TD Error

Key finding (Montague, Dayan & Sejnowski, 1996; Schultz et al., 1997):

Dopaminergic neurons in the ventral tegmental area (VTA) and substantia nigra encode the temporal difference error δ_t:

Dopamine response:
  Before conditioning:   fires when US (reward) occurs
  After conditioning:    fires when CS occurs (predicts reward)
  After conditioning:    no response when reward occurs (V(CS) ≈ reward)
  After conditioning:    fires negatively if CS occurs but reward omitted

→ This is exactly: δ_t = R_{t+1} + γV(S_{t+1}) - V(S_t)

Three key observations: 1. Positive δ: dopamine burst (better than expected) 2. δ ≈ 0: no dopamine change (exactly as expected) 3. Negative δ: dopamine dip/suppression (worse than expected)

This is not a reward signal — it’s a prediction error signal. The brain implements TD learning.

Reward Circuits

Nucleus Accumbens (NAcc): receives dopamine from VTA; involved in reward processing and habit formation. Corresponds to the value function V(s).

Prefrontal Cortex (PFC): goal-directed planning, working memory. Corresponds to model-based planning.

Basal Ganglia: involved in action selection, habit learning. Corresponds to actor (policy).

Amygdala: emotional valence, fear conditioning. Corresponds to reward signal R(s) and emotional salience.

The Actor-Critic in the Brain:

Critic:  Nucleus Accumbens + VTA dopamine
         → computes V(s) and δ_t
Actor:   Dorsal Striatum (Caudate/Putamen) + Basal Ganglia
         → stores and executes Q(s,a) → action selection

Reward Prediction and Temporal Discounting

Humans discount rewards hyperbolically (not exponentially):

Hyperbolic: V(reward at delay d) = R / (1 + kd)
Exponential (standard RL): V(reward at delay d) = R · γ^d

Hyperbolic discounting produces present bias: people prefer $100 now over $110 in a week, but prefer $110 in 52 weeks over $100 in 51 weeks — inconsistent time preferences.

RL models: standard RL uses exponential discounting (γ^d); hyperbolic is more psychologically accurate but harder to optimize (time-inconsistent preferences).

Quasi-hyperbolic discounting (Laibson): β·γ^d — adds a “β < 1” penalty on any delay: - Computationally tractable approximation to hyperbolic - More realistic for human behavior modeling

Prediction and Planning in Hippocampus

Place cells (O’Keefe & Dostrovsky, 1971): fire when animal is at specific location. The hippocampus contains a “cognitive map” of the environment.

Successor representations (Dayan, 1993):

M(s, s') = E[Σ_{t=0}^∞ γ^t 1[S_t = s'] | S_0 = s]

M(s, s’) is the expected discounted number of future visits to s’ starting from s.

TD learning of M: treat “visiting s’” as a reward → M(s, s’) learned by TD!

Value function via SR: V^π(s) = Σ_{s’} M(s, s’) r(s’)

Relationship to replaying hippocampal sequences: animals replay episodes during sleep (forward/backward replay). This mirrors experience replay in DQN — offline consolidation of TD error.

Effort and Fatigue

Effort discounting: organisms discount rewards by the effort required to obtain them:

V(reward requiring effort e) = R · D(e)

where D(e) < 1 for e > 0. High effort → discounted value.

In RL: equivalent to adding a “fatigue cost” to the reward signal:

r_effective(s, a) = r(s, a) - c(effort(s, a))

This shapes the policy to prefer efficient actions — important for motor control and robot locomotion.

Intrinsic Motivation and Curiosity

Intrinsic reward: reward that doesn’t come from the environment’s extrinsic reward signal, but from the agent’s internal state.

Types: 1. Novelty: reward for visiting new states 2. Surprise: reward for observing unexpected events (high δ) 3. Competence/mastery: reward for improving prediction accuracy 4. Information gain: reward for reducing uncertainty

Computational models:

Novelty: r_intrinsic = 1 / √(count(s))  ← inverse square root of visit count
Curiosity (ICM): r_intrinsic = ||f(s) - f̂(s)||²  ← prediction error of learned model
RND (Random Network Distillation): r_intrinsic = ||e(s) - ê(s)||²

These augment the extrinsic reward r_t → r_t + β r_intrinsic_t.

Application: when extrinsic reward is sparse (e.g., exploring a maze with goal at the end), intrinsic motivation enables exploration.

Neuromodulators Beyond Dopamine

Neuromodulator	RL Correspondence	Function
Dopamine	TD error δ_t	Reward prediction error
Serotonin	Discount factor γ or patience	Regulates temporal discounting
Norepinephrine	Exploration ε	Attention, uncertainty, arousal
Acetylcholine	Learning rate α	Promotes plasticity when novel

Exploitation-Exploration tradeoff regulated by norepinephrine — corresponds to ε in ε-greedy.

Chapter 15 Summary

Dopamine = TD error: one of the strongest connections between RL and neuroscience
Brain implements actor-critic: basal ganglia (actor) + dopaminergic VTA/NAcc (critic)
Hippocampus = cognitive map: successor representation + offline replay
Hyperbolic discounting: more psychologically accurate than exponential
Intrinsic motivation: biologically plausible, computationally important for sparse rewards
Neuromodulators: each maps to an RL hyperparameter (α, γ, ε)

Connection to DynamICCL

DynamICCL’s TD error δ_t (throughput surprise) ↔︎ dopamine signal in reward circuits
Exploration in NCCL config space ↔︎ norepinephrine-regulated arousal
Adaptive learning rate (cosine annealing in PPO) ↔︎ acetylcholine-modulated plasticity
Intrinsic curiosity for NCCL: reward for trying novel configurations (count-based UCB) → helps escape local optima