Finite Markov Decision Processes — Formal Framework
Chapter 3 — Finite MDPs
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 47–60
Why MDPs?
Multi-armed bandits (Ch.2) have no state — the optimal action is the same every step. Real problems have state: the optimal action depends on the current situation, and actions affect future states.
The Markov Decision Process is the formal framework for RL. It captures: 1. The agent’s situation at each time step (state S_t) 2. The agent’s choices (action A_t) 3. The environment’s response (next state S_{t+1}, reward R_{t+1}) 4. The dependence structure among these quantities
Formal Definition of a Finite MDP
A finite MDP is a tuple (S, A, P, R, γ) where:
S = finite set of states
A = finite set of actions (or A(s) = actions available from state s)
P: S × A × S → [0, 1] transition probability function
R: S × A × S → ℝ reward function (or just S × A → ℝ)
γ ∈ [0, 1] discount factor
Joint dynamics function (S&B preferred notation):
p(s', r | s, a) = P(S_{t+1}=s', R_{t+1}=r | S_t=s, A_t=a)
This is a probability distribution over (next state, reward) pairs given current (state, action).
Normalizes: Σ_{s’,r} p(s’, r | s, a) = 1 for all s ∈ S, a ∈ A(s)
Derived Quantities
From p(s’, r | s, a) we can compute:
State transition probabilities: p(s' | s, a) = Σ_r p(s', r | s, a)
Expected rewards (state-action): r(s, a) = Σ_{s',r} r · p(s', r | s, a)
Expected rewards (triple): r(s, a, s') = Σ_r r · p(s', r | s, a) / p(s' | s, a)
The Markov Property
Definition: The Markov property holds if:
P(S_{t+1}=s', R_{t+1}=r | S_t, A_t, R_t, S_{t-1}, A_{t-1}, ..., S_0, A_0)
= P(S_{t+1}=s', R_{t+1}=r | S_t, A_t)
The future is conditionally independent of the past given the present state.
Why this matters: enables Bellman equations. The value of a state V(s) depends only on what happens from s forward, not on how we arrived at s.
What counts as “state”?
S_t must be a sufficient statistic of the history — contain all information relevant to future outcomes. If S_t is not Markov, expand it:
| Situation | Non-Markov state | Markov state |
|---|---|---|
| Pole with velocity | position (x) | (x, ẋ) — position + velocity |
| Partially observable | raw sensor reading | history of observations (→ POMDP) |
| Trading with momentum | current price | (price, recent price history) |
| NCCL tuning | current bandwidth | (bandwidth, algorithm, buffer state, recent history) |
The Agent-Environment Boundary
What goes in the state vs. in the environment?
The boundary is not physical but informational: - Everything the agent has complete control over → agent side - Everything the agent doesn’t have complete control over → environment side
Examples: - A thermostat: setpoint (agent), temperature (environment) - A chess engine: chosen move (agent), board position (environment) - DynamICCL: NCCL parameters chosen (agent), resulting throughput and network state (environment)
The agent may not know the complete environment state (→ partial observability, POMDP). But in a fully observable MDP, S_t is everything the agent needs to observe.
Worked Example: Recycling Robot (S&B Box 3.3)
A mobile robot collects empty cans. Battery state, action space, and rewards:
States: S = {high, low} (battery charge)
Actions: A(high) = {search, wait}
A(low) = {search, wait, recharge}
Transitions and rewards:
p(high | high, search) = α, r = r_search
p(low | high, search) = 1-α, r = r_search
p(high | high, wait) = 1, r = r_wait
p(high | low, search) = 1-β, r = -3 (battery dies, gets rescued)
p(low | low, search) = β, r = r_search
p(low | low, wait) = 1, r = r_wait
p(high | low, recharge)= 1, r = 0
Parameters: α = 0.9, β = 0.1, r_search = 1, r_wait = 0.5, γ = 0.9
Transition diagram:
search (α=0.9, r=1) ┌──────────────────┐
┌──────────────────────────►│ │
│ │ HIGH │
│ ◄──────────────────────── wait (r=0.5) │
│ └──────────────────┘
│ │
│ search (1-β=0.9, r=1) │ search (1-α=0.1, r=-3)
│ │
│ ┌────────────────────────────── │
│ ▼ │
│ LOW ──────► recharge ──────────►HIGH
│ │ ├──────── wait (r=0.5) ──────►LOW
│ │ └──────── search (β=0.1,r=1)─►LOW
This small example captures all MDP concepts: stochastic transitions, trade-off between collecting reward now (search) vs. risk management (recharge), and discount.
Episodic vs. Continuing MDPs
Episodic MDPs
Terminal states S_T ∈ S⁺ end an episode. The agent resets to a (possibly random) starting state.
Episode: S_0, A_0, R_1, S_1, ..., R_T, S_T
Return: G_t = R_{t+1} + R_{t+2} + ... + R_T (γ can be 1)
Examples: games (win/lose), navigation tasks (reach goal), trial-based experiments.
Continuing MDPs
No terminal state; interaction continues indefinitely.
Return: G_t = Σ_{k=0}^{∞} γ^k R_{t+k+1} (γ < 1 required)
Examples: process control, autonomous driving, DynamICCL (training jobs don’t “end”).
Unified Notation
Treat terminal state as absorbing with zero reward:
p(S_T | S_T, a) = 1 for all a
R(S_T, a) = 0
Then discounted return formula works for both.
Policy Definitions
A policy π specifies how the agent selects actions given the current state:
Stochastic policy: π(a | s) = P(A_t = a | S_t = s) - π(a | s) ≥ 0, Σ_a π(a | s) = 1 for all s
Deterministic policy: π(s) = a (special case)
Stationary policy: same action distribution regardless of time t (most policies in RL are stationary)
Non-stationary policy: π_t(a | s) — depends on time (useful theoretically but rarely needed in practice)
Connection to DynamICCL
DynamICCL’s MDP formalization:
State S_t = (bandwidth utilization, message sizes, topology graph,
current NCCL algorithm, pending ops queue, GPU memory pressure)
Action A_t = (NCCL algorithm ∈ {ring, tree, double-binary-tree},
chunk size ∈ discrete set,
pipeline depth ∈ {1,...,8})
Reward R_{t+1} = (achieved throughput - baseline throughput) / baseline
OR job completion time improvement
γ = 0.99 (long-horizon: NCCL decisions have lasting impact on job time)
The Markov property holds approximately: given full current system state, future throughput is approximately independent of the history of past configurations. In practice, GPU warm-up effects create non-Markovian dependencies, handled by including recent history in the state.