What is Reinforcement Learning?

Chapter 1 — Introduction Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 1–24

The Core Idea

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.

Three distinguishing features:

No supervisor: the learner is not told which actions to take; must discover which actions yield most reward by trying them
Delayed reward: actions may affect not just the immediate reward but also subsequent rewards (credit assignment problem)
Sequential decisions: the agent operates over time, in a closed feedback loop with the environment

The Agent-Environment Interface

                         ┌────────────────────┐
          Action A_t     │                    │
    ┌────────────────────►   Environment      │
    │                    │                    │
    │                    └──────┬─────┬───────┘
    │                           │     │
    │             State S_{t+1} │     │ Reward R_{t+1}
    │                           │     │                
    │                           │     │                
    │                           │     │
    │                           │     │
  ┌─┴─────────────────────┐     │     │
  │                       │ ◄───┘     │
  │        Agent          │           │
  │    π: S → A           │ ◄─────────┘
  └───────────────────────┘

At each discrete time step t = 0, 1, 2, 3, …:

Agent observes state S_t ∈ S
Agent selects action A_t ∈ A(S_t) according to policy π
Environment transitions to next state S_{t+1}
Environment emits reward R_{t+1} ∈ R ⊂ ℝ

The interaction generates a trajectory:

S_0, A_0, R_1, S_1, A_1, R_2, S_2, …

The Four Main Subelements

1. Policy (π)

The agent’s behavior function: π(a|s) = P(A_t=a | S_t=s) for stochastic policies.

Deterministic special case: π(s) = a.

The policy is the core of what the agent learns. All other subelements serve to find a good policy.

2. Reward Signal

The environment sends a single scalar reward R_t at each time step.

Reward hypothesis (S&B): All goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

Key principle: the reward defines what to achieve, not how to achieve it. The agent must discover the how through learning.

3. Value Function

V^π(s) = expected cumulative reward starting from state s and following policy π.

Reward = immediate desirability of a state
Value = long-term desirability of a state

Critical distinction: a state may give low immediate reward but high value (stepping stone to good states), or high immediate reward but low value (leads to bad outcomes).

Actions are chosen based on value judgments, not immediate rewards.

4. Model (optional)

A model of the environment: something that mimics the environment’s behavior.

Model-based methods: use a model for planning before acting (e.g., Dynamic Programming, Dyna).

Model-free methods: learn directly from experience without a model (e.g., Q-learning, SARSA, Monte Carlo).

Model-free methods are less efficient but more flexible and applicable to unknown environments.

RL vs. Supervised and Unsupervised Learning

Property	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Training signal	Labeled examples	No labels	Scalar rewards
Feedback timing	Immediate	N/A	Delayed
Data generation	Fixed dataset	Fixed dataset	Online, agent-generated
Goal	Generalize from examples	Find structure	Maximize cumulative reward
Key challenge	Generalization	Finding latent structure	Exploration + credit assignment

RL is online learning: the agent generates its own training data through interaction with the environment. Past actions influence future training examples.

Why RL is Hard: The Two Core Challenges

1. The Credit Assignment Problem

Which of the past actions caused the eventual reward?

Example: a chess program wins after 50 moves. Which moves were good? The reward at the end provides no direct signal about intermediate decisions.

Temporal credit assignment: attributing reward backward through time to the actions that caused it.

2. Exploration-Exploitation Tradeoff

Exploit: choose the best action known so far → maximize immediate expected reward
Explore: try unknown actions → might discover better long-term strategies

Both are necessary:

Pure exploitation: gets stuck at local optima
Pure exploration: never leverages accumulated knowledge

No universal solution — must balance based on problem structure.

Tic-Tac-Toe: A Worked Example

S&B’s motivating example demonstrates RL without any special domain knowledge:

Setup:

State = board configuration (9 squares × 3 values = ~19,683 states)
Actions = where to place X (open squares)
Reward = +1 for win, 0 for draw/loss

RL approach (value function learning):

Initialize V(s) = 0.5 for all non-terminal states; V(win) = 1; V(loss/draw) = 0
At each move: select greedy action (move to state with highest V) most of the time; occasionally explore randomly
After each greedy move: update value of previous state toward value of current state:
```
V(s) ← V(s) + α[V(s') - V(s)]
```
(TD update — previews TD learning in Ch.6)

Key observations:

Agent learns from self-play — no supervisor says “this was a good move”
V values reflect long-term win probability from each state
After enough play, values converge to approximate probability of winning from each position
Same algorithm generalizes to any 2-player game without domain modification

Contrast with minimax: minimax requires a model (knowing all legal moves); RL learns from experience. Minimax requires opponent model; RL adapts to the actual opponent.

Summary

RL is defined by:

An agent interacting with an environment in a closed loop
The goal of maximizing cumulative scalar reward
No direct instruction signal — only reward feedback
The core challenges of credit assignment and exploration-exploitation

The field spans psychology (operant conditioning), economics (decision theory), control theory, and computer science.