POMDPs and Bandit Problems
Partially Observable MDPs (POMDPs)
In a POMDP, the agent cannot observe the state directly — it receives percepts (observations) with some probability.
Formal definition: - States S, Actions A, Observations O - Transition: P(s’ | s, a) - Observation model: P(o | s’, a) - Reward: R(s, a)
Key insight: the agent maintains a belief state b — a probability distribution over states.
b(s) = P(current state = s | history of actions and observations)
POMDP as a Belief-State MDP
The belief state b is sufficient for optimal decision-making (it’s a sufficient statistic for history).
Define a belief-state MDP: - States: belief states b ∈ Δ(S) (probability simplices) - Actions: same A - Transition: b’ = UPDATE(b, a, o) - Reward: R(b,a) = Σ_s b(s)·R(s,a)
Belief update (Bayesian filtering):
b'(s') = α · P(o | s', a) · Σ_s P(s' | s, a) · b(s)
Optimal POMDP policy: maps belief state → action.
POMDP Solution: Value Function Over Beliefs
The optimal value function V*(b) over belief states is piecewise linear and convex (PWLC):
V*(b) = max_a [R(b,a) + γ · Σ_o P(o|b,a) · V*(b')]
Represents V* as a set of α-vectors (hyperplanes):
V*(b) = max_α (α · b)
POMDP algorithms: PBVI (Point-Based Value Iteration), SARSOP, Perseus — sample a finite set of reachable belief points, compute α-vectors at each.
Exact solution requires exponential computation in |S|; approximate methods are practical for moderate-size problems.
Bandit Problems
A multi-armed bandit is a simplified MDP with: - k “arms” (actions), each with unknown reward distribution - No state transitions — just immediate reward signals - Goal: maximize total reward over T trials
Exploration-exploitation tradeoff: - Explore: try less-known arms → gather information - Exploit: choose the best known arm → earn rewards now
Bandit Algorithms
ε-Greedy
With probability ε: choose random arm (explore)
With probability 1-ε: choose arm with highest Q̂(a) (exploit)
Simple; ε-greedy with ε→0 converges to optimal.
UCB1 (Upper Confidence Bound)
a* = argmax_a [Q̂(a) + c · √(ln N / N(a))]
Where N = total trials, N(a) = trials of arm a, Q̂(a) = empirical mean reward.
UCB1 achieves O(log T) regret — asymptotically optimal.
Regret: Σ [μ* - μ_{aₜ}] — cumulative gap between optimal and chosen arm.
Thompson Sampling (Bayesian)
For each arm a: sample θ_a ~ P(θ_a | history)
Choose arm a* = argmax_a θ_a
For Bernoulli rewards with Beta(α,β) priors: Thompson sampling achieves optimal O(log T) regret.
Gittins Index
For discounted bandits, the Gittins index provides the optimal solution: - Compute a single index for each arm based on its current posterior - Play the arm with the highest index
The Gittins index decouples the problem: each arm can be analyzed independently. Remarkable result: the optimal policy for the multi-armed bandit separates into independent per-arm decisions.
Connection to DynamICCL / RL
- Bandit: choosing among NCCL algorithm variants = multi-armed bandit (if no state)
- Contextual bandit: choosing NCCL parameters given network context = bandit with side information
- POMDP: full DynamICCL setting — hidden network state, partial observations, sequential decisions
- UCB1 → UCT (Ch.5 MCTS) → UCB-based RL exploration
- Thompson sampling → posterior-based RL exploration (Bayesian RL)