1 Introduction
The purpose of this article is to analyse an anytime version of the Optimally Confident UCB algorithm for finitearmed subgaussian bandits (Lattimore, 2015). For the sake of brevity I will give neither a detailed introduction nor an exhaustive survey of the literature. Readers looking for a gentle primer on multiarmed bandits might enjoy the monograph by Bubeck and CesaBianchi (2012) from which I borrow notation. Let be the number of arms and be the arm chosen in round . The reward is where
is the unknown vector of means and the noise term
is assumed to be subgaussian (therefore zeromean). The step pseudoregret of strategy given mean vector with maximum mean iswhere the expectation is taken with respect to uncertainty in both the rewards and actions. In all analysis I make the standard notational assumption that . The new algorithm is called OCUCB and depends on two parameters and . The algorithm chooses in rounds and subsequently with
(1) 
where is the number of times arm has been chosen after round and
is its empirical estimate and
Besides the algorithm, the contribution of this article is a proof that OCUCB satisfies a nearly optimal regret bound.
Theorem 1.
If and , then
where and and is a constant that depends only on . Furthermore, for all it holds that .
Asymptotically the upper bound matches lower bound given by Lai and Robbins (1985) except for a factor of . In the nonasymptotic regime the additional terms inside the logarithm significantly improves on UCB. The bound in Theorem 1 corresponds to a worstcase regret that is suboptimal by a factor of just . Algorithms achieving the minimax rate are MOSS (Audibert and Bubeck, 2009) and OCUCB, but both require advance knowledge of the horizon. The quantity may be interpreted as the number of “effective” arms with larger values leading to improved regret. A simple observation is that is always nonincreasing in , which makes the canonical choice. In the special case that all suboptimal arms have the same expected payoff, then for all . Interestingly I could not find a regime for which the algorithm is empirically sensitive to . If , then except for additive terms the problem dependent regret enjoyed by OCUCB is equivalent to OCUCB. Finally, if , then the asymptotic result above applies, but the algorithm in that case essentially reduces to MOSS, which is known to suffer suboptimal finitetime regret in certain regimes (Lattimore, 2015).
Intuition for regret bound. Let us fix a strategy and mean vector and suboptimal arm . Suppose that for some . Now consider the alternative mean reward with for and , which means that is the optimal action for mean vector . Standard informationtheoretic analysis shows that and are not statistically separable at confidence level and in particular, if is large enough, then . For mean we have and for any reasonable algorithm we would like
But this implies that should be chosen such that
which up to terms justifies the nearoptimality of the regret guarantee given in Theorem 1 for close to . Of course is not known in advance, so no algorithm can choose this confidence level. The trick is to notice that arms with should be played about as often as arm and arms with should be played about as much as arm until . This means that as approaches the critical number of samples we can approximate
Then the index used by OCUCB is justified by ignoring terms and the usual used by UCB and other algorithms. Theorem 1 is proven by making the above approximation rigorous. The argument for this choice of confidence level is made concrete in Appendix A where I present a lower bound that matches the upper bound except for additive terms.
2 Concentration
The regret guarantees rely on a number of concentration inequalities. For this section only let be i.i.d. 1subgaussian and and . The first lemma below is well known and follows trivially from the maximal inequality and the fact that the rewards are subguassian.
Important remark. For brevity I use to indicate a constant that depends on but not other variables such as and . The dependence is never worse than polynomial in .
Lemma 2.
If , then .
The following lemma analyses the likelihood that ever exceeds where . By the law of the iterated logarithm a.s. and for small it has been shown by Garivier (2013) that
The case where seems not to have been analysed and relies on the usual peeling trick, but without the union bound.
Lemma 3.
There exists a monotone nondecreasing function such that for all it holds that .
Lemma 4.
Let and and , then
The final concentration lemma is quite powerful and forms the lynchpin of the following analysis.
Lemma 5.
Let and and and be constants. Furthermore, let
be the random variable given by
Finally let . Then

If , then

If , then
3 Analysis of the KLUCB+ Algorithm
Let us warm up by analysing a simpler algorithm, which chooses the arm that maximises the following index.
(2) 
Strategies similar to this have been called KLUCB+ and suggested as a heuristic by
Garivier and Cappé (2011) (this version is specified to the subgaussian noise model). Recently Kaufmann (2016) has established the asymptotic optimality of strategies with approximately this form, but finitetime analysis has not been available until now. Bounding the regret will follow the standard path of bounding for each suboptimal arm . Let be the empirical estimate of the mean of the th arm having observed samples. Define and byIf and , then by the definition of we have and by the definition of
which means that . Therefore may be bounded in terms of and as follows:
It remains to bound the expectations of and . By Lemma 5a with and and it follows that and by Lemma 4
Therefore the strategy in Eq. 2 satisfies:
Remark 6.
Without changing the algorithm and by optimising the constants in the proof it is possible to show that , which is just a factor of away from the asymptotic lower bound of Lai and Robbins (1985).
4 Proof of Theorem 1
The proof follows along similar lines as the warmup, but each step becomes more challenging, especially controlling .
Step 1: Setup and preliminary lemmas
Define to be the random set of arms for which the empirical estimate never drops below the critical boundary given by the law of iterated logarithm.
(3) 
where . By Lemma 3, . It will be important that only includes arms and that the events are independent for . From the definition of the index and for it holds that for all . The following lemma shows that the pull counts for optimistic arms “chase” those of other arms up the point that they become clearly suboptimal.
Lemma 7.
There exists a constant depending only on such that if (a) and (b) and (c) , then .
Proof.
First note that implies that . Comparing the indices:
On the other hand, by choosing small enough and by the definition of :
which implies that . ∎
Let be the optimistic arm with the largest return where if we define and . By Lemma 3,
with constant probability, which means that
is subexponentially distributed with rate dependent on
only. Define by(4) 
where is as chosen in Lemma 7. Since we will have with high probability (this will be made formal later). Let
(5) 
The following lemma essentially follows from Lemma 4 and the fact that is subexponentially distributed. Care must be taken because and are not independent. The proof is found in Appendix E.
Lemma 8.
.
The last lemma in this section shows that if , then either is not chosen or the index of the th arm is not too large.
Lemma 9.
If , then or .
Proof.
By the definition of we have and . By Lemma 7, if and , then . Now suppose that for all . Then
Therefore from the definition of we have that . ∎
Step 2: Regret decomposition
By Lemma 9, if , then or . Now we must show there exists a for which . This is true for arms with since by definition for all . For the remaining arms we follow the idea used in Section 3 and define a random time for each .
(6) 
Then the regret is decomposed as follows
(7) 
The next step is to show that the first sum is dominant in the above decomposition, which will lead to the result via Lemma 8 to bound .
Step 3: Bounding
This step is broken into two quite technical parts as summarised in the following lemma. The proofs of both results are quite similar, but the second is more intricate and is given in Appendix G.
Lemma 10.
The following hold:
Proof of Lemma 10a.
Preparing to use Lemma 5, let be given by for with and otherwise. Now define random variable by
and . Then for and abbreviating we have
where the second last inequality follows since for arms with we have and for other arms by definition. The last inequality follows from the definition of . Therefore and so , which by Lemma 5b is bounded by
(8) 
where the last line follows since and
The resulting is completed substituting into Eq. 8 and applying Lemma 8 to show that . ∎
Step 4: Putting it together
By substituting the bounds given in Lemma 10 into Eq. 7 and applying Lemma 8 we obtain
which completes the proof of the finitetime bound.
Asymptotic analysis. Lemma 5 makes this straightforward. Let and
Then by Lemma 5a with and we have . Then we modify the definition of by
which is chosen such that if , then . Therefore
Classical analysis shows that and , which implies the asymptotic claim in Theorem 1.
This naive calculation demonstrates a serious weakness of asymptotic results. The term in the regret will typically dominate the higherorder terms except when is outrageously large. A more careful argument (similar to the derivation of the finitetime bound) would lead to the same asymptotic bound via a nicer finitetime bound, but the details are omitted for readability. Interestingly the result is not dependent on and so applies also to the MOSStype algorithm that is recovered by choosing .
5 Discussion
The UCB family has a new member. This one is tuned for subgaussian noise and roughly mimics the OCUCB algorithm, but without needing advance knowledge of the horizon. The introduction of is a minor refinement on previous measures of difficulty, with the main advantage being that it is very intuitive. The resulting algorithm is efficient and close to optimal theoretically. Of course there are open questions, some of which are detailed below.
Shrinking the confidence level. Empirically the algorithm improves significantly when the logarithmic terms in the definition of are dropped. There are several arguments that theoretically justify this decision. First of all if , then it is possible to replace the term in the definition of with just and use part (a) of Lemma 5 instead of part (b). The price is that the regret guarantee explodes as tends to (also not observed in practice). The second improvement is to replace in the definition of with
which boosts empirical performance and rough sketches suggest minimax optimality is achieved. I leave details for a longer article.
Improving analysis and constants. Despite its simplicity relative to OCUCB, the current analysis is still significantly more involved than for other variants of UCB. A cleaner proof would obviously be desirable. In an ideal world we could choose or (slightly worse) allow it to converge to as grows, which is the technique used in the KLUCB algorithm (Cappé et al., 2013, and others). I anticipate this would lead to an asymptotically optimal algorithm.
Informational confidence bounds. Speaking of KLUCB, if the noise model is known more precisely (for example, it is bounded), then it is beneficial to use confidence bounds based on the KL divergence. Such bounds are available and could be substituted directly to improve performance without loss (Garivier, 2013, and others)
. Repeating the above analysis, but exploiting the benefits of tighter confidence intervals would be an interesting (nontrivial) problem due to the need to exploit the nonsymmetric KL divergences. It is worth remarking that confidence bounds based on the KL divergence are also
not tight. For example, for Gaussian random variables they lead to the right exponential rate, but with the wrong leading factor, which in practice can improve performance as evidenced by the confidence bounds used by (near) Bayesian algorithms that exactly exploit the noise model (eg., Kaufmann et al. (2012); Lattimore (2016); Kaufmann (2016)). This is related to the “missing factor” in Hoeffding’s bound studied by Talagrand (1995).Precise lower bounds. Perhaps the most important remaining problem for the subgaussian noise model is the question of lower bounds. Besides the asymptotic results by Lai and Robbins (1985) and Burnetas and Katehakis (1997) there has been some recent progress on finitetime lower bounds, both in the OCUCB paper and a recent article by Garivier et al. (2016). Some further progress is made in Appendix A, but still there are regimes where the bounds are not very precise.
References
 Audibert and Bubeck [2009] JeanYves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of Conference on Learning Theory (COLT), pages 217–226, 2009.

Bubeck and CesaBianchi [2012]
Sébastien Bubeck and Nicolò CesaBianchi.
Regret Analysis of Stochastic and Nonstochastic Multiarmed
Bandit Problems.
Foundations and Trends in Machine Learning. Now Publishers Incorporated, 2012.
ISBN 9781601986269. 
Burnetas and Katehakis [1997]
Apostolos N Burnetas and Michael N Katehakis.
Optimal adaptive policies for markov decision processes.
Mathematics of Operations Research, 22(1):222–255, 1997.  Cappé et al. [2013] Olivier Cappé, Aurélien Garivier, OdalricAmbrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
 Garivier [2013] Aurélien Garivier. Informational confidence bounds for selfnormalized averages and applications. arXiv preprint arXiv:1309.3376, 2013.
 Garivier and Cappé [2011] Aurélien Garivier and Olivier Cappé. The KLUCB algorithm for bounded stochastic bandits and beyond. In Proceedings of Conference on Learning Theory (COLT), 2011.
 Garivier et al. [2016] Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. arXiv preprint arXiv:1602.07182, 2016.
 Kaufmann [2016] Emilie Kaufmann. On Bayesian index policies for sequential resource allocation. arXiv preprint arXiv:1601.01190, 2016.

Kaufmann et al. [2012]
Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier.
On Bayesian upper confidence bounds for bandit problems.
In
International Conference on Artificial Intelligence and Statistics
, pages 592–600, 2012.  Lai and Robbins [1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
 Lattimore [2015] Tor Lattimore. Optimally confident UCB: Improved regret for finitearmed bandits. Technical report, 2015. URL http://arxiv.org/abs/1507.07880.
 Lattimore [2016] Tor Lattimore. Regret analysis of the finitehorizon Gittins index strategy for multiarmed bandits. In Proceedings of Conference on Learning Theory (COLT), 2016.
 Talagrand [1995] Michel Talagrand. The missing factor in Hoeffding’s inequalities. In Annales de l’IHP Probabilités et statistiques, volume 31, pages 689–702, 1995.
 Tsybakov [2008] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
Appendix A Lower Bounds
I now prove a kind of lower bound showing that the form of the regret in Theorem 1 is approximately correct for close to . The result contains a lower order term, which for large dominates the improvements, but is meaningful in many regimes.
Theorem 11.
Assume a standard Gaussian noise model and let be any strategy and be such that for all . Then one of the following holds:

.

There exists an with such that
where and for and and are defined as and but using .
Proof.
On our way to a contradiction, assume that neither of the items hold. Let be a suboptimal arm and be as in the second item above. I write and for expectation when when rewards are sampled from . Suppose
(9) 
Then Lemma 2.6 in the book by Tsybakov [2008] and the same argument as used by Lattimore [2015] gives
By Markov’s inequality
Therefore , which implies that
which is a contradiction. Therefore Eq. 9 does not hold for all with , but this also leads immediately to a contradiction, since then
Comments
There are no comments yet.