Finite Markov Decision Processes — Formal Framework

Chapter 3 — Finite MDPs

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 47–60

Why MDPs?

Multi-armed bandits (Ch.2) have no state — the optimal action is the same every step. Real problems have state: the optimal action depends on the current situation, and actions affect future states.

The Markov Decision Process is the formal framework for RL. It captures: 1. The agent’s situation at each time step (state S_t) 2. The agent’s choices (action A_t) 3. The environment’s response (next state S_{t+1}, reward R_{t+1}) 4. The dependence structure among these quantities

Formal Definition of a Finite MDP

A finite MDP is a tuple (S, A, P, R, γ) where:

S = finite set of states
A = finite set of actions (or A(s) = actions available from state s)
P: S × A × S → [0, 1]    transition probability function
R: S × A × S → ℝ          reward function (or just S × A → ℝ)
γ ∈ [0, 1]                discount factor

Joint dynamics function (S&B preferred notation):

p(s', r | s, a) = P(S_{t+1}=s', R_{t+1}=r | S_t=s, A_t=a)

This is a probability distribution over (next state, reward) pairs given current (state, action).

Normalizes: Σ_{s’,r} p(s’, r | s, a) = 1 for all s ∈ S, a ∈ A(s)

Derived Quantities

From p(s’, r | s, a) we can compute:

State transition probabilities:   p(s' | s, a) = Σ_r p(s', r | s, a)
Expected rewards (state-action):  r(s, a)       = Σ_{s',r} r · p(s', r | s, a)
Expected rewards (triple):        r(s, a, s')    = Σ_r r · p(s', r | s, a) / p(s' | s, a)

The Markov Property

Definition: The Markov property holds if:

P(S_{t+1}=s', R_{t+1}=r | S_t, A_t, R_t, S_{t-1}, A_{t-1}, ..., S_0, A_0)
= P(S_{t+1}=s', R_{t+1}=r | S_t, A_t)

The future is conditionally independent of the past given the present state.

Why this matters: enables Bellman equations. The value of a state V(s) depends only on what happens from s forward, not on how we arrived at s.

What counts as “state”?

S_t must be a sufficient statistic of the history — contain all information relevant to future outcomes. If S_t is not Markov, expand it:

Situation	Non-Markov state	Markov state
Pole with velocity	position (x)	(x, ẋ) — position + velocity
Partially observable	raw sensor reading	history of observations (→ POMDP)
Trading with momentum	current price	(price, recent price history)
NCCL tuning	current bandwidth	(bandwidth, algorithm, buffer state, recent history)

The Agent-Environment Boundary

What goes in the state vs. in the environment?

The boundary is not physical but informational: - Everything the agent has complete control over → agent side - Everything the agent doesn’t have complete control over → environment side

Examples: - A thermostat: setpoint (agent), temperature (environment) - A chess engine: chosen move (agent), board position (environment) - DynamICCL: NCCL parameters chosen (agent), resulting throughput and network state (environment)

The agent may not know the complete environment state (→ partial observability, POMDP). But in a fully observable MDP, S_t is everything the agent needs to observe.

Worked Example: Recycling Robot (S&B Box 3.3)

A mobile robot collects empty cans. Battery state, action space, and rewards:

States: S = {high, low}   (battery charge)
Actions: A(high) = {search, wait}
         A(low)  = {search, wait, recharge}

Transitions and rewards:
  p(high | high, search) = α,  r = r_search
  p(low  | high, search) = 1-α, r = r_search
  p(high | high, wait)   = 1,  r = r_wait
  p(high | low, search)  = 1-β, r = -3  (battery dies, gets rescued)
  p(low  | low, search)  = β,  r = r_search
  p(low  | low, wait)    = 1,  r = r_wait
  p(high | low, recharge)= 1,  r = 0

Parameters: α = 0.9, β = 0.1, r_search = 1, r_wait = 0.5, γ = 0.9

Transition diagram:

       search (α=0.9, r=1)    ┌──────────────────┐
  ┌──────────────────────────►│                  │
  │                           │  HIGH             │
  │   ◄──────────────────────── wait (r=0.5)     │
  │                           └──────────────────┘
  │                                  │
  │   search (1-β=0.9, r=1)         │ search (1-α=0.1, r=-3)
  │                                  │
  │  ┌────────────────────────────── │
  │  ▼                               │
  │  LOW ──────► recharge ──────────►HIGH
  │  │ ├──────── wait (r=0.5) ──────►LOW
  │  │ └──────── search (β=0.1,r=1)─►LOW

This small example captures all MDP concepts: stochastic transitions, trade-off between collecting reward now (search) vs. risk management (recharge), and discount.

Episodic vs. Continuing MDPs

Episodic MDPs

Terminal states S_T ∈ S⁺ end an episode. The agent resets to a (possibly random) starting state.

Episode: S_0, A_0, R_1, S_1, ..., R_T, S_T
Return:  G_t = R_{t+1} + R_{t+2} + ... + R_T  (γ can be 1)

Examples: games (win/lose), navigation tasks (reach goal), trial-based experiments.

Continuing MDPs

No terminal state; interaction continues indefinitely.

Return: G_t = Σ_{k=0}^{∞} γ^k R_{t+k+1}    (γ < 1 required)

Examples: process control, autonomous driving, DynamICCL (training jobs don’t “end”).

Unified Notation

Treat terminal state as absorbing with zero reward:

p(S_T | S_T, a) = 1 for all a
R(S_T, a) = 0

Then discounted return formula works for both.

Policy Definitions

A policy π specifies how the agent selects actions given the current state:

Stochastic policy: π(a | s) = P(A_t = a | S_t = s) - π(a | s) ≥ 0, Σ_a π(a | s) = 1 for all s

Deterministic policy: π(s) = a (special case)

Stationary policy: same action distribution regardless of time t (most policies in RL are stationary)

Non-stationary policy: π_t(a | s) — depends on time (useful theoretically but rarely needed in practice)

Connection to DynamICCL

DynamICCL’s MDP formalization:

State S_t  = (bandwidth utilization, message sizes, topology graph,
              current NCCL algorithm, pending ops queue, GPU memory pressure)

Action A_t = (NCCL algorithm ∈ {ring, tree, double-binary-tree},
              chunk size ∈ discrete set,
              pipeline depth ∈ {1,...,8})

Reward R_{t+1} = (achieved throughput - baseline throughput) / baseline
              OR job completion time improvement

γ = 0.99  (long-horizon: NCCL decisions have lasting impact on job time)

The Markov property holds approximately: given full current system state, future throughput is approximately independent of the history of past configurations. In practice, GPU warm-up effects create non-Markovian dependencies, handled by including recent history in the state.