What is Reinforcement Learning?

Chapter 1 — Introduction Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 1–24


The Core Idea

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.

Three distinguishing features:

  1. No supervisor: the learner is not told which actions to take; must discover which actions yield most reward by trying them
  2. Delayed reward: actions may affect not just the immediate reward but also subsequent rewards (credit assignment problem)
  3. Sequential decisions: the agent operates over time, in a closed feedback loop with the environment

The Agent-Environment Interface

                         ┌────────────────────┐
          Action A_t     │                    │
    ┌────────────────────►   Environment      │
    │                    │                    │
    │                    └──────┬─────┬───────┘
    │                           │     │
    │             State S_{t+1} │     │ Reward R_{t+1}
    │                           │     │                
    │                           │     │                
    │                           │     │
    │                           │     │
  ┌─┴─────────────────────┐     │     │
  │                       │ ◄───┘     │
  │        Agent          │           │
  │    π: S → A           │ ◄─────────┘
  └───────────────────────┘

At each discrete time step t = 0, 1, 2, 3, …:

The interaction generates a trajectory:

S_0, A_0, R_1, S_1, A_1, R_2, S_2, …


The Four Main Subelements

1. Policy (π)

The agent’s behavior function: π(a|s) = P(A_t=a | S_t=s) for stochastic policies.

Deterministic special case: π(s) = a.

The policy is the core of what the agent learns. All other subelements serve to find a good policy.

2. Reward Signal

The environment sends a single scalar reward R_t at each time step.

Reward hypothesis (S&B): All goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

Key principle: the reward defines what to achieve, not how to achieve it. The agent must discover the how through learning.

3. Value Function

V^π(s) = expected cumulative reward starting from state s and following policy π.

Critical distinction: a state may give low immediate reward but high value (stepping stone to good states), or high immediate reward but low value (leads to bad outcomes).

Actions are chosen based on value judgments, not immediate rewards.

4. Model (optional)

A model of the environment: something that mimics the environment’s behavior.

Model-based methods: use a model for planning before acting (e.g., Dynamic Programming, Dyna).

Model-free methods: learn directly from experience without a model (e.g., Q-learning, SARSA, Monte Carlo).

Model-free methods are less efficient but more flexible and applicable to unknown environments.


RL vs. Supervised and Unsupervised Learning

Property Supervised Learning Unsupervised Learning Reinforcement Learning
Training signal Labeled examples No labels Scalar rewards
Feedback timing Immediate N/A Delayed
Data generation Fixed dataset Fixed dataset Online, agent-generated
Goal Generalize from examples Find structure Maximize cumulative reward
Key challenge Generalization Finding latent structure Exploration + credit assignment

RL is online learning: the agent generates its own training data through interaction with the environment. Past actions influence future training examples.


Why RL is Hard: The Two Core Challenges

1. The Credit Assignment Problem

Which of the past actions caused the eventual reward?

Example: a chess program wins after 50 moves. Which moves were good? The reward at the end provides no direct signal about intermediate decisions.

Temporal credit assignment: attributing reward backward through time to the actions that caused it.

2. Exploration-Exploitation Tradeoff

Both are necessary:

No universal solution — must balance based on problem structure.


Tic-Tac-Toe: A Worked Example

S&B’s motivating example demonstrates RL without any special domain knowledge:

Setup:

RL approach (value function learning):

  1. Initialize V(s) = 0.5 for all non-terminal states; V(win) = 1; V(loss/draw) = 0

  2. At each move: select greedy action (move to state with highest V) most of the time; occasionally explore randomly

  3. After each greedy move: update value of previous state toward value of current state:

    V(s) ← V(s) + α[V(s') - V(s)]

    (TD update — previews TD learning in Ch.6)

Key observations:

Contrast with minimax: minimax requires a model (knowing all legal moves); RL learns from experience. Minimax requires opponent model; RL adapts to the actual opponent.


Summary

RL is defined by:

  1. An agent interacting with an environment in a closed loop
  2. The goal of maximizing cumulative scalar reward
  3. No direct instruction signal — only reward feedback
  4. The core challenges of credit assignment and exploration-exploitation

The field spans psychology (operant conditioning), economics (decision theory), control theory, and computer science.