What is Reinforcement Learning?
Chapter 1 — Introduction Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 1–24
The Core Idea
Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.
Three distinguishing features:
- No supervisor: the learner is not told which actions to take; must discover which actions yield most reward by trying them
- Delayed reward: actions may affect not just the immediate reward but also subsequent rewards (credit assignment problem)
- Sequential decisions: the agent operates over time, in a closed feedback loop with the environment
The Agent-Environment Interface
┌────────────────────┐
Action A_t │ │
┌────────────────────► Environment │
│ │ │
│ └──────┬─────┬───────┘
│ │ │
│ State S_{t+1} │ │ Reward R_{t+1}
│ │ │
│ │ │
│ │ │
│ │ │
┌─┴─────────────────────┐ │ │
│ │ ◄───┘ │
│ Agent │ │
│ π: S → A │ ◄─────────┘
└───────────────────────┘
At each discrete time step t = 0, 1, 2, 3, …:
- Agent observes state S_t ∈ S
- Agent selects action A_t ∈ A(S_t) according to policy π
- Environment transitions to next state S_{t+1}
- Environment emits reward R_{t+1} ∈ R ⊂ ℝ
The interaction generates a trajectory:
S_0, A_0, R_1, S_1, A_1, R_2, S_2, …
The Four Main Subelements
1. Policy (π)
The agent’s behavior function: π(a|s) = P(A_t=a | S_t=s) for stochastic policies.
Deterministic special case: π(s) = a.
The policy is the core of what the agent learns. All other subelements serve to find a good policy.
2. Reward Signal
The environment sends a single scalar reward R_t at each time step.
Reward hypothesis (S&B): All goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
Key principle: the reward defines what to achieve, not how to achieve it. The agent must discover the how through learning.
3. Value Function
V^π(s) = expected cumulative reward starting from state s and following policy π.
- Reward = immediate desirability of a state
- Value = long-term desirability of a state
Critical distinction: a state may give low immediate reward but high value (stepping stone to good states), or high immediate reward but low value (leads to bad outcomes).
Actions are chosen based on value judgments, not immediate rewards.
4. Model (optional)
A model of the environment: something that mimics the environment’s behavior.
Model-based methods: use a model for planning before acting (e.g., Dynamic Programming, Dyna).
Model-free methods: learn directly from experience without a model (e.g., Q-learning, SARSA, Monte Carlo).
Model-free methods are less efficient but more flexible and applicable to unknown environments.
RL vs. Supervised and Unsupervised Learning
| Property | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
|---|---|---|---|
| Training signal | Labeled examples | No labels | Scalar rewards |
| Feedback timing | Immediate | N/A | Delayed |
| Data generation | Fixed dataset | Fixed dataset | Online, agent-generated |
| Goal | Generalize from examples | Find structure | Maximize cumulative reward |
| Key challenge | Generalization | Finding latent structure | Exploration + credit assignment |
RL is online learning: the agent generates its own training data through interaction with the environment. Past actions influence future training examples.
Why RL is Hard: The Two Core Challenges
1. The Credit Assignment Problem
Which of the past actions caused the eventual reward?
Example: a chess program wins after 50 moves. Which moves were good? The reward at the end provides no direct signal about intermediate decisions.
Temporal credit assignment: attributing reward backward through time to the actions that caused it.
2. Exploration-Exploitation Tradeoff
- Exploit: choose the best action known so far → maximize immediate expected reward
- Explore: try unknown actions → might discover better long-term strategies
Both are necessary:
- Pure exploitation: gets stuck at local optima
- Pure exploration: never leverages accumulated knowledge
No universal solution — must balance based on problem structure.
Tic-Tac-Toe: A Worked Example
S&B’s motivating example demonstrates RL without any special domain knowledge:
Setup:
- State = board configuration (9 squares × 3 values = ~19,683 states)
- Actions = where to place X (open squares)
- Reward = +1 for win, 0 for draw/loss
RL approach (value function learning):
Initialize V(s) = 0.5 for all non-terminal states; V(win) = 1; V(loss/draw) = 0
At each move: select greedy action (move to state with highest V) most of the time; occasionally explore randomly
After each greedy move: update value of previous state toward value of current state:
V(s) ← V(s) + α[V(s') - V(s)](TD update — previews TD learning in Ch.6)
Key observations:
- Agent learns from self-play — no supervisor says “this was a good move”
- V values reflect long-term win probability from each state
- After enough play, values converge to approximate probability of winning from each position
- Same algorithm generalizes to any 2-player game without domain modification
Contrast with minimax: minimax requires a model (knowing all legal moves); RL learns from experience. Minimax requires opponent model; RL adapts to the actual opponent.
Summary
RL is defined by:
- An agent interacting with an environment in a closed loop
- The goal of maximizing cumulative scalar reward
- No direct instruction signal — only reward feedback
- The core challenges of credit assignment and exploration-exploitation
The field spans psychology (operant conditioning), economics (decision theory), control theory, and computer science.