3 Lower bounds on the difficulty ratio

Most of the bounds on the difficulty ratio we derive are consequences of the following theorem.

Theorem 2

Let \(H : \mathcal D \to \mathbb {R}^+\) be an arbitrary difficulty function. Let \(\mu , \lambda \in \mathcal D\) be such that \(i^*(\lambda ) \ne i^*(\mu )\) and \(H(\lambda ) \le \sqrt{T}\). Then for any algorithm \(\mathcal A\),

\begin{align*} R_{H,T}(\mathcal A, \lambda )^{-1} (1 - p_{\mu ,T}(\mathcal A)) - \frac{\log 2}{\sqrt{T}} \le H(\lambda )\sum _{k=1}^K \mathbb {E}_\mu [\frac{N_{T,k}}{T}] \mathrm{KL}(\mu _k, \lambda _k) \: . \end{align*}

The proof of this inequality follows the standard bandit lower bound argument, appealing to the data processing inequality for the KL divergence, which can be found for example in [ GMS19 ] . The proof is in Appendix B. The only mildly original step is to put \(H(\lambda )\) on the right of the inequality instead of writing a lower bound on \(p_{\lambda , T}(\mathcal A)\) (which would give a bound akin to Lemma 6 of [ BGS22 ] when taking the limit as \(T \to + \infty \)).

Theorem 3

For any consistent algorithm family \(\mathcal A\), for all \(\mu \in \mathcal D\) and all sets \(D(\mu ) \subseteq \mathrm{Alt}(\mu )\),

\begin{align*} (\sup _{\lambda \in D(\mu )} R_{H,\infty }(\mathcal A, \lambda ))^{-1} & \le \max _{\omega \in \triangle _K} \inf _{\lambda \in D(\mu )} H(\lambda )\sum _{k=1}^K \omega _k \mathrm{KL}(\mu _k, \lambda _k) \: . \\ \text{Furthermore}, \quad (\sup _{\lambda \in \mathcal D} R_{H,\infty }(\mathcal A, \lambda ))^{-1} & \le \inf _{\mu \in \mathcal D} \max _{\omega \in \triangle _K} \inf _{\lambda \in \mathrm{Alt}(\mu )} H(\lambda )\sum _{k=1}^K \omega _k \mathrm{KL}(\mu _k, \lambda _k) \: . \end{align*}

Proof ▼

Let \(\mu \in \mathcal D\). Since \(\triangle _K\) is compact, the sequence \((\mathbb {E}_{\mu , \mathcal A}[N_{T}/T])_{T \in \mathbb {N}}\) has a subsequence indexed by some \((T_n)_{n \in \mathbb {N}}\) which converges to a vector \(\omega _{\mu } \in \triangle _K\). Let \(\lambda \in \mathrm{Alt}(\mu )\). Theorem 2 gives, for \(n\) large enough,

\begin{align*} R_{H,T_n}(\mathcal A, \lambda )^{-1} (1 - p_{\mu , T_n}(\mathcal A)) - \frac{\log 2}{\sqrt{T_n}} \le H(\lambda )\sum _{k=1}^K \mathbb {E}_\mu [\frac{N_{T_n,k}}{T_n}] \mathrm{KL}(\mu _k, \lambda _k) \: . \end{align*}

Since \(\mathcal A\) is consistent, \(1 - p_{\mu , T_n}(\mathcal A) \to 1\). Taking a limit as \(n \to +\infty \), we have

\begin{align*} \liminf _{n \to + \infty } R_{H,T_n}(\mathcal A, \lambda )^{-1} \le H(\lambda )\sum _{k=1}^K \omega _{\mu , k} \mathrm{KL}(\mu _k, \lambda _k) \: . \end{align*}

That bound on the liminf of a subsequence gives a bound on the liminf of the whole sequence. We finally take an infimum over \(\lambda \in D(\mu )\) on both sides of the inequality, and replace \(\omega _{\mu }\) by a maximum over the simplex. We proved the first statement. The second inequality is obtained by choosing \(D(\mu ) = \mathrm{Alt}(\mu )\) and taking an infimum over \(\mu \in \mathcal D\).

The second inequality of Theorem 3 recovers Theorem 1 of [ KTH22 ] , which they prove differently: they introduce typical concentration events, reduce the study to those events and use a change of measure. Their proof does not give an explicit non-asymptotic version of the bound, unlike Theorem 2. In contrast, our short proof is a direct application of the data processing inequality for the KL divergence.

Instead of an inequality on the supremum of the limsup of \(R_{H, T}(\mathcal A, \mu )\) as in Theorem 3, we can also get a bound on the liminf of the supremum of \(R_{H, T}(\mathcal A, \mu )\) over sets with bounded \(H\). See Theorem 10 in Appendix B. We will use Theorem 3 in order to describe the asymptotic difficulty of fixed budget identification. We could derive bounds for a fixed \(T\) by using Theorem 2 instead, at the cost of second order terms and restrictions of the alternative to problems with \(H\) bounded by \(\sqrt{T}\), that is to problems which are not too hard at time \(T\).

Corollary 1

Let \(\mu , \lambda ^{(1)}, \ldots , \lambda ^{(K)}\) be such that for all \(j \in [K]\), \(i^\star (\lambda ^{(j)}) \ne i^\star (\mu )\), \(H(\lambda ^{(j)}){\gt}0\), and each \(\lambda ^{(j)}\) differ from \(\mu \) only along coordinate \(j\). Then for all algorithms \(\mathcal A\) such that \(\lim _{T \to +\infty } p_{\mu , T}(\mathcal A) = 0\) ,

\begin{align*} \sup _{j \in [K]} R_{H,\infty }(\mathcal A,\lambda ^{(j)}) \ge \sum _{j=1}^K \frac{1}{H(\lambda ^{(j)}) \mathrm{KL}(\mu _j, \lambda _j^{(j)})} \: . \end{align*}

Proof ▼

We apply the first inequality of Theorem 3 with \(D(\mu ) = \{ \lambda ^{(1)}, \ldots , \lambda ^{(K)}\} \).

\begin{align*} (\sup _{j \in [K]} R_{H,\infty }(\mathcal A, \lambda ^{(j)}))^{-1} & \le \max _{\omega \in \triangle _K} \inf _{j \in [K]} H(\lambda ^{(j)})\sum _{k=1}^K \omega _k \mathrm{KL}(\mu _k, \lambda _k^{(j)}) \\ & = \max _{\omega \in \triangle _K} \inf _{j \in [K]} H(\lambda ^{(j)}) \omega _j \mathrm{KL}(\mu _j, \lambda _j^{(j)}) \: . \end{align*}

The optimal \(\omega \) equalizes \(H(\lambda ^{(j)}) \omega _j \mathrm{KL}(\mu _j, \lambda _j^{(j)})\) for all \(j\), which gives the result.

The sum on the right hand side of Corollary 1 is very close to the quantity \(h^*\) defined in [ CL16 ] in the setting of Bernoulli bandits with \(H\) the sum of inverse squared gaps. This is due to the similar construction of a set of points in the alternative that each differ from a given \(\mu \in \mathcal D\) in one coordinate only. That construction was reused by [ AKK\(^{+}\)21 ] to get a bound on a quantity called expected policy regret and by [ YT22 ] to prove a lower bound for fixed budget BAI in linear bandits.

The main advantage of Corollary 1 is that it is simpler to use than Theorem 10, but it can lead to worse bounds. For example in BAI in two-arms Gaussian bandits with known variance 1, with \(H = H_{\mathcal C^{sp}}\) Theorem 3 gives \(\sup _{\lambda \in \mathcal D} R_{H, \infty }(\mathcal A, \lambda ) \ge 1\) while the best bound that can be achieved with Corollary 1 is 1/2. That task is very simple, as remarked by [ KCG16 ] : the oracle fixed proportions are independent of the means (both arms are played equally), which means that the algorithm that plays those proportions has \(\sup _{\lambda \in \mathcal D} R_{H, \infty }(\mathcal A, \lambda ) \le 1\). Theorem 3 shows that this is tight and that no adaptive algorithm can beat it everywhere. We could not arrive to that conclusion with the weaker Corollary 1 since it only proves a \(1/2\) lower bound.