POMDPs and Bandit Problems

Partially Observable MDPs (POMDPs)

In a POMDP, the agent cannot observe the state directly — it receives percepts (observations) with some probability.

Formal definition: - States S, Actions A, Observations O - Transition: P(s’ | s, a) - Observation model: P(o | s’, a) - Reward: R(s, a)

Key insight: the agent maintains a belief state b — a probability distribution over states.

b(s) = P(current state = s | history of actions and observations)

POMDP as a Belief-State MDP

The belief state b is sufficient for optimal decision-making (it’s a sufficient statistic for history).

Define a belief-state MDP: - States: belief states b ∈ Δ(S) (probability simplices) - Actions: same A - Transition: b’ = UPDATE(b, a, o) - Reward: R(b,a) = Σ_s b(s)·R(s,a)

Belief update (Bayesian filtering):

b'(s') = α · P(o | s', a) · Σ_s P(s' | s, a) · b(s)

Optimal POMDP policy: maps belief state → action.

POMDP Solution: Value Function Over Beliefs

The optimal value function V*(b) over belief states is piecewise linear and convex (PWLC):

V*(b) = max_a [R(b,a) + γ · Σ_o P(o|b,a) · V*(b')]

Represents V* as a set of α-vectors (hyperplanes):

V*(b) = max_α (α · b)

POMDP algorithms: PBVI (Point-Based Value Iteration), SARSOP, Perseus — sample a finite set of reachable belief points, compute α-vectors at each.

Exact solution requires exponential computation in |S|; approximate methods are practical for moderate-size problems.

Bandit Problems

A multi-armed bandit is a simplified MDP with: - k “arms” (actions), each with unknown reward distribution - No state transitions — just immediate reward signals - Goal: maximize total reward over T trials

Exploration-exploitation tradeoff: - Explore: try less-known arms → gather information - Exploit: choose the best known arm → earn rewards now

Bandit Algorithms

ε-Greedy

With probability ε: choose random arm (explore)
With probability 1-ε: choose arm with highest Q̂(a) (exploit)

Simple; ε-greedy with ε→0 converges to optimal.

UCB1 (Upper Confidence Bound)

a* = argmax_a [Q̂(a) + c · √(ln N / N(a))]

Where N = total trials, N(a) = trials of arm a, Q̂(a) = empirical mean reward.

UCB1 achieves O(log T) regret — asymptotically optimal.

Regret: Σ [μ* - μ_{aₜ}] — cumulative gap between optimal and chosen arm.

Thompson Sampling (Bayesian)

For each arm a: sample θ_a ~ P(θ_a | history)
Choose arm a* = argmax_a θ_a

For Bernoulli rewards with Beta(α,β) priors: Thompson sampling achieves optimal O(log T) regret.

Gittins Index

For discounted bandits, the Gittins index provides the optimal solution: - Compute a single index for each arm based on its current posterior - Play the arm with the highest index

The Gittins index decouples the problem: each arm can be analyzed independently. Remarkable result: the optimal policy for the multi-armed bandit separates into independent per-arm decisions.

Connection to DynamICCL / RL

Bandit: choosing among NCCL algorithm variants = multi-armed bandit (if no state)
Contextual bandit: choosing NCCL parameters given network context = bandit with side information
POMDP: full DynamICCL setting — hidden network state, partial observations, sequential decisions
UCB1 → UCT (Ch.5 MCTS) → UCB-based RL exploration
Thompson sampling → posterior-based RL exploration (Bayesian RL)