Optimistic Initial Values and Upper Confidence Bound

Chapter 2 — Multi-Armed Bandits

Book: Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed) Pages: 34–39

Optimistic Initial Values

Idea: initialize Q_1(a) to a value higher than any expected reward. Forces exploration: every action will initially disappoint (return less than optimistic estimate), driving the agent to try all actions.

How It Works

Suppose q*(a) ∈ [0, 1] for all a, but we initialize Q_1(a) = 5 for all a.

Step 1: A_1 = argmax Q(a) → any action (all tied) - Observe R_1 ≈ q*(A_1) ≈ some value ≤ 1 - Update Q(A_1) = 5 + α(R_1 - 5) ≈ 5 + 0.1(0.5 - 5) = 4.55

Step 2: argmax Q(a) = any untried action (still at 5.0) - Agent naturally explores all actions before settling

This is “optimism in the face of uncertainty” — a fundamental principle: > Principle: when uncertain, be optimistic. Optimism drives exploration automatically.

Properties

Advantages: - Zero code complexity: just set Q_1(a) = high value - Forces systematic early exploration - No explicit ε parameter needed - Works well for stationary problems

Disadvantages: - Exploration eventually stops (no exploration once estimates settle) - Not robust to nonstationary problems: initial exploration bonus exhausted, then pure greedy - Requires knowing a good upper bound for rewards

Comparison on 10-Armed Testbed

Method	Early performance	Asymptotic
Greedy, Q_1=0	Peaks early (lucky), often gets stuck	Suboptimal
Greedy, Q_1=5 (optimistic)	Initially bad (all exploring)	Near-optimal
ε-greedy, Q_1=0, ε=0.1	Medium	Suboptimal (always 10% random)

Optimistic initialization works better than ε-greedy on stationary problems but worse on nonstationary.

Upper Confidence Bound (UCB) Action Selection

Problem with ε-greedy: exploration is undirected — selects randomly among all suboptimal actions with equal probability. Does not account for how uncertain each action’s estimate is or how suboptimal it might be.

Better idea: explore based on confidence intervals. Actions with high uncertainty deserve more exploration.

UCB Formula

A_t = argmax_a [Q_t(a) + c · √(ln t / N_t(a))]

where: - Q_t(a) = current value estimate for action a - N_t(a) = number of times action a selected before time t - ln t = natural log of current timestep (grows slowly) - c > 0 = confidence level (exploration constant)

If N_t(a) = 0: treat a as maximizing (select it)

Interpretation of the Bound

The term c√(ln t / N_t(a)) is a confidence radius for Q_t(a):

Q_t(a) + c√(ln t / N_t(a)) = upper confidence bound on q*(a)

Large N_t(a): bound is tight → exploit (small bonus for well-tested actions)
Small N_t(a): bound is wide → explore (large bonus for under-tested actions)
ln t in numerator: bound grows slowly over time, ensuring all actions explored infinitely often

This implements “optimism in the face of uncertainty” precisely: act as if each action’s true value equals the upper end of its confidence interval.

Theoretical Basis: Hoeffding’s Inequality

For rewards in [0, 1] and N_t(a) samples:

P(q*(a) > Q_t(a) + ε) ≤ exp(-2N_t(a)ε²)

Setting this probability = 1/t²:

ε = √(ln t / (2N_t(a)))

This gives the UCB bonus with c = 1/√2. The probability that q*(a) exceeds the upper bound decreases polynomially in t.

UCB Regret Analysis

UCB achieves O(log T) regret — optimal up to constants:

E[Regret_T] = Σ_a Δ_a · E[N_T(a)] ≤ Σ_{a: Δ_a > 0} [8 ln T / Δ_a + (1 + π²/3) Δ_a]

This matches the Lai-Robbins lower bound (up to constants).

Compare: - ε-greedy: O(T) regret asymptotically (always wastes ε fraction of pulls on random exploration) - UCB: O(log T) regret — sublinear, grows slowly

Pseudocode: UCB

Initialize:
  Q(a) = 0, N(a) = 0 for all a

For t = 1, 2, ...:
  If any a has N(a) = 0:
    A_t = that action    ← pull every arm once first
  Else:
    A_t = argmax_a [Q(a) + c · √(ln(t) / N(a))]

  Observe R_t
  N(A_t) ← N(A_t) + 1
  Q(A_t) ← Q(A_t) + (1/N(A_t)) · (R_t - Q(A_t))

Performance on 10-Armed Testbed

UCB with c = 2 outperforms ε-greedy (ε = 0.1) after initial exploration phase. The difference is more pronounced with: - Many arms k (ε-greedy wastes pulls on clearly inferior arms; UCB concentrates on uncertain ones) - Limited time horizon T (ε-greedy’s constant exploration is relatively more wasteful)

UCB Variants

Variant	Modification	When to use
UCB1	Original, c = √2	Rewards in [0,1]
UCB2	Epoch-based exploration	Sometimes faster in practice
KL-UCB	Uses KL divergence instead of Gaussian bound	Known reward distribution
LinUCB	Linear bandit (contextual)	When state features available
Thompson Sampling	Bayesian posterior sampling	Nonstationary, well-calibrated priors

Limitations of UCB for Full RL

UCB works well for stateless bandits. Difficulties in full RL: 1. Nonstationary: ln t / N_t(a) bound assumes stationary distribution — violated if q*(a) changes 2. Large action spaces: computing UCB for continuous actions requires approximation 3. State-conditioned: in MDPs, need UCB per (state, action) pair — exponential table

In RL: modified versions (Bonus-based exploration, Count-based exploration with pseudo-counts) extend UCB to MDPs.

Connection to DynamICCL

UCB’s principle — “explore actions we’re uncertain about” — is more principled than ε-greedy’s random exploration. For DynamICCL, UCB logic can guide exploration of NCCL configuration space: configurations tried few times get an exploration bonus. However, the continuous, high-dimensional NCCL state space requires generalization (LinUCB or neural UCB variants rather than tabular UCB).