Optimistic Initial Values and Upper Confidence Bound
Chapter 2 — Multi-Armed Bandits
Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 34–39
Optimistic Initial Values
Idea: initialize Q_1(a) to a value higher than any expected reward. Forces exploration: every action will initially disappoint (return less than optimistic estimate), driving the agent to try all actions.
How It Works
Suppose q*(a) ∈ [0, 1] for all a, but we initialize Q_1(a) = 5 for all a.
Step 1: A_1 = argmax Q(a) → any action (all tied) - Observe R_1 ≈ q*(A_1) ≈ some value ≤ 1 - Update Q(A_1) = 5 + α(R_1 - 5) ≈ 5 + 0.1(0.5 - 5) = 4.55
Step 2: argmax Q(a) = any untried action (still at 5.0) - Agent naturally explores all actions before settling
This is “optimism in the face of uncertainty” — a fundamental principle: > Principle: when uncertain, be optimistic. Optimism drives exploration automatically.
Properties
Advantages: - Zero code complexity: just set Q_1(a) = high value - Forces systematic early exploration - No explicit ε parameter needed - Works well for stationary problems
Disadvantages: - Exploration eventually stops (no exploration once estimates settle) - Not robust to nonstationary problems: initial exploration bonus exhausted, then pure greedy - Requires knowing a good upper bound for rewards
Comparison on 10-Armed Testbed
| Method | Early performance | Asymptotic |
|---|---|---|
| Greedy, Q_1=0 | Peaks early (lucky), often gets stuck | Suboptimal |
| Greedy, Q_1=5 (optimistic) | Initially bad (all exploring) | Near-optimal |
| ε-greedy, Q_1=0, ε=0.1 | Medium | Suboptimal (always 10% random) |
Optimistic initialization works better than ε-greedy on stationary problems but worse on nonstationary.
Upper Confidence Bound (UCB) Action Selection
Problem with ε-greedy: exploration is undirected — selects randomly among all suboptimal actions with equal probability. Does not account for how uncertain each action’s estimate is or how suboptimal it might be.
Better idea: explore based on confidence intervals. Actions with high uncertainty deserve more exploration.
UCB Formula
A_t = argmax_a [Q_t(a) + c · √(ln t / N_t(a))]
where: - Q_t(a) = current value estimate for action a - N_t(a) = number of times action a selected before time t - ln t = natural log of current timestep (grows slowly) - c > 0 = confidence level (exploration constant)
If N_t(a) = 0: treat a as maximizing (select it)
Interpretation of the Bound
The term c√(ln t / N_t(a)) is a confidence radius for Q_t(a):
Q_t(a) + c√(ln t / N_t(a)) = upper confidence bound on q*(a)
- Large N_t(a): bound is tight → exploit (small bonus for well-tested actions)
- Small N_t(a): bound is wide → explore (large bonus for under-tested actions)
- ln t in numerator: bound grows slowly over time, ensuring all actions explored infinitely often
This implements “optimism in the face of uncertainty” precisely: act as if each action’s true value equals the upper end of its confidence interval.
Theoretical Basis: Hoeffding’s Inequality
For rewards in [0, 1] and N_t(a) samples:
P(q*(a) > Q_t(a) + ε) ≤ exp(-2N_t(a)ε²)
Setting this probability = 1/t²:
ε = √(ln t / (2N_t(a)))
This gives the UCB bonus with c = 1/√2. The probability that q*(a) exceeds the upper bound decreases polynomially in t.
UCB Regret Analysis
UCB achieves O(log T) regret — optimal up to constants:
E[Regret_T] = Σ_a Δ_a · E[N_T(a)] ≤ Σ_{a: Δ_a > 0} [8 ln T / Δ_a + (1 + π²/3) Δ_a]
This matches the Lai-Robbins lower bound (up to constants).
Compare: - ε-greedy: O(T) regret asymptotically (always wastes ε fraction of pulls on random exploration) - UCB: O(log T) regret — sublinear, grows slowly
Pseudocode: UCB
Initialize:
Q(a) = 0, N(a) = 0 for all a
For t = 1, 2, ...:
If any a has N(a) = 0:
A_t = that action ← pull every arm once first
Else:
A_t = argmax_a [Q(a) + c · √(ln(t) / N(a))]
Observe R_t
N(A_t) ← N(A_t) + 1
Q(A_t) ← Q(A_t) + (1/N(A_t)) · (R_t - Q(A_t))
Performance on 10-Armed Testbed
UCB with c = 2 outperforms ε-greedy (ε = 0.1) after initial exploration phase. The difference is more pronounced with: - Many arms k (ε-greedy wastes pulls on clearly inferior arms; UCB concentrates on uncertain ones) - Limited time horizon T (ε-greedy’s constant exploration is relatively more wasteful)
UCB Variants
| Variant | Modification | When to use |
|---|---|---|
| UCB1 | Original, c = √2 | Rewards in [0,1] |
| UCB2 | Epoch-based exploration | Sometimes faster in practice |
| KL-UCB | Uses KL divergence instead of Gaussian bound | Known reward distribution |
| LinUCB | Linear bandit (contextual) | When state features available |
| Thompson Sampling | Bayesian posterior sampling | Nonstationary, well-calibrated priors |
Limitations of UCB for Full RL
UCB works well for stateless bandits. Difficulties in full RL: 1. Nonstationary: ln t / N_t(a) bound assumes stationary distribution — violated if q*(a) changes 2. Large action spaces: computing UCB for continuous actions requires approximation 3. State-conditioned: in MDPs, need UCB per (state, action) pair — exponential table
In RL: modified versions (Bonus-based exploration, Count-based exploration with pseudo-counts) extend UCB to MDPs.
Connection to DynamICCL
UCB’s principle — “explore actions we’re uncertain about” — is more principled than ε-greedy’s random exploration. For DynamICCL, UCB logic can guide exploration of NCCL configuration space: configurations tried few times get an exploration bonus. However, the continuous, high-dimensional NCCL state space requires generalization (LinUCB or neural UCB variants rather than tabular UCB).