D Proofs of results from Section 5

D.1 Gaussian bandits

Proof of Theorem 7 ▼

First, since \(\mathcal C^{sp} \subseteq \mathcal C\), for any algorithm \(\mathcal A\) and \(\mu \in \mathcal D\), \(R_{H_{\mathcal C}, T}(\mathcal A, \mu ) \ge R_{H_{\mathcal C^{sp}}, T}(\mathcal A, \mu )\). It suffices to give a lower bound for \(H_{\mathcal C^{sp}}\).

Let \(H_{\Delta }(\mu ) = \frac{2}{\min _{k : \Delta _k{\gt}0} \Delta _k^2} + \sum _{k:\Delta _k{\gt}0} \frac{2}{\Delta _k^2}\). It was shown in [ GK16 ] that for all \(\mu \in \mathcal D\), this function satisfies the inequalities \(H_\Delta (\mu ) \le H_{C^{sp}}(\mu ) \le 2 H_\Delta (\mu )\) . Thus \(R_{H_{\mathcal C}, T}(\mathcal A, \mu ) \ge R_{H_\Delta , T}(\mathcal A, \mu ) / 2\). From this point on, we use a construction similar to the one used in [ CL16 ] to prove a lower bound on the ratio to \(H_\Delta \) for Bernoulli bandits. We define a Gaussian problem \(\mu \) by \(\mu _1 = 0\) (or any arbitrary value) and \(\mu _k = \mu _1 - k \Delta \) for all \(k \in \{ 2, \ldots , K\} \) and some arbitrary \(\Delta {\gt} 0\). We apply Corollary 1 to \(\mu \) and \(\lambda ^{(2)}, \ldots , \lambda ^{(K)}\) where each \(\lambda ^{(j)}\) is identical to \(\mu \) except that \(\lambda ^{(j)}_j = \mu _1 + (\mu _1 - \mu _j)\).

\begin{align*} \sup _{j \in \{ 2, \ldots , K\} } \limsup _{T\to +\infty } R_{H_\Delta ,T}(\mathcal A, \lambda ^{(j)}) \ge \sum _{j=2}^K \frac{1}{H_\Delta (\lambda ^{(j)}) \mathrm{KL}(\mu _j, \lambda _j^{(j)})} = \sum _{j=2}^K \frac{1}{\frac{(\lambda _j^{(j)} - \mu _j)^2}{(\lambda _j^{(j)} - \mu _1)^2} + \sum _{k\ne j} \frac{(\lambda _j^{(j)} - \mu _j)^2}{(\lambda _j^{(j)} - \mu _k)^2}} \: . \end{align*}

For our specific choice of \(\lambda ^{(j)}\),

\begin{align*} \frac{(\lambda _j^{(j)} - \mu _j)^2}{(\lambda _j^{(j)} - \mu _1)^2} + \sum _{k \ne j} \frac{(\lambda _j^{(j)} - \mu _j)^2}{(\lambda _j^{(j)} - \mu _k)^2} & \le 4 + 4 \sum _{k \ne j} \frac{(\mu _1 - \mu _j)^2}{(\mu _1 - \mu _j)^2 + (\mu _1 - \mu _k)^2} \le 4 j + 4\sum _{k {\gt} j} \frac{(\mu _1 - \mu _j)^2}{(\mu _1 - \mu _k)^2} \: . \end{align*}

We now use that \(\mu _k = \mu _1 - k \Delta \).

\begin{align*} \frac{(\lambda _j^{(j)} - \mu _j)^2}{(\lambda _j^{(j)} - \mu _1)^2} + \sum _{k \ne j} \frac{(\lambda _j^{(j)} - \mu _j)^2}{(\lambda _j^{(j)} - \mu _k)^2} & \le 4 j + 4 j^2 \sum _{k {\gt} j} \frac{1}{k^2} \le 4 j + 4 j^2 \frac{1}{j} \le 8 j \: . \end{align*}

We finally have the lower bound

\begin{align*} \sup _{j \in \{ 2, \ldots , K\} } \limsup _{T\to +\infty } R_{H_\Delta ,T}(\mathcal A, \lambda ^{(j)}) \ge \frac{1}{8}\sum _{j=2}^K \frac{1}{j} \ge \frac{1}{8}(\log (K+1) - \log 2) \ge \frac{3}{40} \log K \: . \end{align*}

D.2 Bernoulli bandits

We consider the best arm identification task in bandits with two arms, both in the same exponential family with one parameter. Two distributions in that family with means \(\mu _1, \mu _2\) correspond to some natural parameters \(\xi _1, \xi _2\) and the Kullback-Leibler divergence can be written

\begin{align*} \mathrm{KL}(\mu _1, \mu _2) = d(\xi _2, \xi _1) = \phi (\xi _2) - \phi (\xi _1) - (\xi _2 - \xi _1) \phi ’(\xi _1) \: , \end{align*}

where \(\phi : \mathbb {R} \to \mathbb {R}\) is a convex function specific to the exponential family and \(d\) is its Bregman divergence. The mean parameter \(\mu _1\) and the corresponding natural parameter \(\xi _1\) are related by the equation \(\phi '(\xi _1) = \mu _1\) (or \(\xi _1 = \phi '^{-1}(\mu _1)\) since \(\phi '\) is invertible). In that setting, we want to compute

\begin{align*} (H_{\mathcal C^{sp}}(\mu ))^{-1} & = \max _{\omega \in \triangle _K} \inf _{\lambda \in \mathrm{Alt}(\mu )} \sum _{k=1}^K \omega _k \mathrm{KL}(\lambda _k, \mu _k) \\ & = \max _{\omega \in \triangle _2} \inf _{x} (\omega _1 \mathrm{KL}(x, \mu _1) + \omega _2 \mathrm{KL}(x, \mu _2)) \: . \end{align*}

Lemma 7

In the one-parameter exponential family setting described above,

\begin{align*} (H_{\mathcal C^{sp}}(\mu ))^{-1} = \mathrm{KL}\left(\frac{\phi (\xi _1) - \phi (\xi _2)}{\xi _1 - \xi _2}, \mu _1\right) \: . \end{align*}

The infimum in the definition of the difficulty is attained for any \(\omega \) at \(x(\omega ) = \phi '(\omega _1 \xi _1 + \omega _2 \xi _2)\). The maximum over the simplex is attained at \(\omega ^\star \) such that \(\omega ^\star _1 = \frac{\phi '^{-1}(x^\star ) - \xi _2}{\xi _1 - \xi _2}\), with \(x^\star = x(\omega ^\star ) = \frac{\phi (\xi _1) - \phi (\xi _2)}{\xi _1 - \xi _2}\).

We can also rewrite \(\frac{\phi (\xi _1) - \phi (\xi _2)}{\xi _1 - \xi _2} = \mu _2 + \frac{\mathrm{KL}(\mu _2, \mu _1)}{\xi _1 - \xi _2} = \mu _1 - \frac{\mathrm{KL}(\mu _1, \mu _2)}{\xi _1 - \xi _2}\).

Proof ▼

We parametrize by the natural parameters:

\begin{align*} \inf _{x} (\omega _1 \mathrm{KL}(x, \mu _1) + \omega _2 \mathrm{KL}(x, \mu _2)) = \inf _{y} (\omega _1 d(\xi _1, y) + \omega _2 d(\xi _2, y)) \end{align*}

The optimality condition for \(y\) is \(\omega _1 \frac{\partial }{\partial y}d(\xi _1, y) + \omega _2 \frac{\partial }{\partial y}d(\xi _2, y) = 0\). That derivative is \(\frac{\partial }{\partial y}d(x, y) = - (x - y)\phi ''(y)\). We obtain

\begin{align*} & \omega _1 (\xi _1 - y)\phi ”(y) + \omega _2 (\xi _2 - y)\phi ”(y) = 0 \\ \implies \quad & y = \omega _1 \xi _1 + \omega _2 \xi _2 \: . \end{align*}

We note for later the property

\begin{align} \label{eq:exp_fam_minimizer_grad} \omega _1 \frac{\partial }{\partial y}d(\xi _1, y) + \omega _2 \frac{\partial }{\partial y}d(\xi _2, y) = 0 \quad \text{at } y = \omega _1 \xi _1 + \omega _2 \xi _2 \: . \end{align}

We now want to compute

\begin{align*} \max _{\omega \in \triangle _2} (\omega _1 d(\xi _1, \omega _1 \xi _1 + \omega _2 \xi _2) + \omega _2 d(\xi _2, \omega _1 \xi _1 + \omega _2 \xi _2)) \\ =\max _{\omega _1 \in [0,1]} (\omega _1 d(\xi _1, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) + (1 - \omega _1) d(\xi _2, \omega _1 \xi _1 + (1 - \omega _1) \xi _2)) \end{align*}

At the optimal value for \(\omega \) the gradient is zero:

\begin{align*} & d(\xi _1, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) - d(\xi _2, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) + \omega _1 \frac{\partial }{\partial y} d(\xi _1, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) (\xi _1 - \xi _2) \\ & + (1 - \omega _1) \frac{\partial }{\partial y} d(\xi _2, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) (\xi _1 - \xi _2) = 0 \end{align*}

We use Equation ?? to get that \(\omega _2 \frac{\partial }{\partial y} d(\xi _2, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) = - \omega _1 \frac{\partial }{\partial y} d(\xi _1, \omega _1 \xi _1 + (1 - \omega _1) \xi _2)\). We simplify the equation to

\begin{align*} d(\xi _1, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) = d(\xi _2, \omega _1 \xi _1 + (1 - \omega _1) \xi _2) \end{align*}

We expand the Bregman divergence.

\begin{align*} & \phi (\xi _1) - \phi (y) - (\xi _1 - y) \phi ’(y) - \phi (\xi _2) + \phi (y) + (\xi _2 - y) \phi ’(y) = 0 \\ \implies & \phi ’(y) = \frac{\phi (\xi _1) - \phi (\xi _2)}{\xi _1 - \xi _2} \end{align*}

Solving this equation for \(y\) also gives the value of \(\omega \) thanks to \(y = \omega _1 \xi _1 + (1 - \omega _1) \xi _2\). We get \(\omega _1 = \frac{y - \xi _2}{\xi _1 - \xi _2}\), and \(y\) is given by the equation above. The value of the objective is then

\begin{align*} \max _{\omega \in \triangle _2} \inf _{x} (\omega _1 \mathrm{KL}(x, \mu _1) + \omega _2 \mathrm{KL}(x, \mu _2)) = d(\xi _1, y) \\ \text{where } y = \phi ’^{-1}\left(\frac{\phi (\xi _1) - \phi (\xi _2)}{\xi _1 - \xi _2}\right) \: . \end{align*}

But we can simplify this further since \(d(\xi _1, y) = \mathrm{KL}(\phi '(y), \mu _1)\) (also equal to \(\mathrm{KL}(\phi '(y), \mu _2)\)).

\begin{align*} \max _{\omega \in \triangle _2} \inf _{x} (\omega _1 \mathrm{KL}(x, \mu _1) + \omega _2 \mathrm{KL}(x, \mu _2)) = \mathrm{KL}(\frac{\phi (\xi _1) - \phi (\xi _2)}{\xi _1 - \xi _2}, \mu _1) \: . \end{align*}

Lemma 8

If the distributions with parameters \(\mu _1\) and \(\mu _2\) are \(\sigma ^2\)-sub-Gaussian, then

\begin{align*} (H_{\mathcal C^{sp}}(\mu ))^{-1} \ge \frac{1}{2 \sigma ^2(\xi _1 - \xi _2)^2}\max \{ \mathrm{KL}(\mu _1, \mu _2)^2, \mathrm{KL}(\mu _2, \mu _1)^2\} \: . \end{align*}

For an exponential family of Gaussians with same variance \(\sigma ^2\) there is equality, and the two terms of the maximum are equal.

Gaussian case

For Gaussian distributions, the functions used above are

\(\phi (a) = \frac{1}{2} \sigma ^2 a^2\) with \(\phi '(a) = a \sigma ^2\), \(\phi '^{-1}(x) = x / \sigma ^2\), \(\phi (\phi '^{-1}(x)) = \frac{1}{2 \sigma ^2} x^2\)
\(d(a, b) = \sigma ^2(\frac{1}{2} a^2 - \frac{1}{2} b^2 - (a - b) b) = \frac{1}{2}\sigma ^2(a - b)^2\)
\(\mathrm{KL}(x, y) = \frac{1}{2 \sigma ^2}(x - y)^2\).

Using these values in Lemma 7 gives a static proportions difficulty equal to the inverse of \(\frac{1}{8 \sigma ^2}(\mu _1 - \mu _2)^2\).

Bernoulli case

For Bernoulli distributions, the functions used above are

\(\phi (a) = \log (1 + e^a)\) with \(\phi '(a) = \frac{e^a}{1 + e^a}\), \(\phi '^{-1}(x) = \log \frac{x}{1 - x}\), \(\phi (\phi '^{-1}(x)) = - \log (1 - x)\)
\(d(a, b) = \log (1 + e^a) - \log (1 + e^b) - (a - b) \frac{e^b}{1 + e^b}\)
\(\mathrm{KL}(x, y) = x \log \frac{x}{y} + (1-x) \log \frac{1-x}{1-y}\).

Using these values in Lemma 7 proves Lemma 1.

\begin{align*} \max _{\omega \in \triangle _2} \inf _{x} (\omega _1 \mathrm{KL}(x, \mu _1) + \omega _2 \mathrm{KL}(x, \mu _2)) = \mathrm{KL}\left( \frac{\log \frac{1 - \mu _2}{1 - \mu _1}}{\log \frac{\mu _1(1 - \mu _2)}{(1 - \mu _1)\mu _2}}, \mu _1 \right) \: . \end{align*}

We gather now a few limits, which will be useful in the proof of Theorem 8. These results use the explicit formulas for \(H_{\mathcal C^{sp}}\) derived above.

\begin{align*} \lim _{x \to 0} H_{\mathcal C^{sp}}((x, 1/2)) & = 1/\log 2 \: , \\ \lim _{x \to 0} \mathrm{KL}(x, 1/2) & = \log 2 \: , \\ \lim _{x \to 0, y \to 0, x/y \to 1} \frac{1}{H_{\mathcal C^{sp}}((y/2, y)) \mathrm{KL}(x, y/2)} & = \frac{1 - \frac{1}{2 \log 2} - \frac{\log (2 \log 2)}{2 \log 2}}{\log 2 - 1/2} \approx 0.22 \: . \end{align*}

Proof of Theorem 8 ▼

For \(x \in (0,1/2)\), let \(\mu (x) = (x(1+x), x)\), \(\lambda ^{(1)}(x) = (x/2, x)\), \(\lambda ^{(2)}(x) = (x(1+x), 1/2)\). Then Corollary 1 gives

\begin{align*} \sup _{j \in [2]} R_{H_{\mathcal C^{sp}}, \infty }(\mathcal A, \lambda ^{(j)}(x)) & \ge \frac{1}{H_{\mathcal C^{sp}}(\lambda ^{(1)}(x))\mathrm{KL}(\mu _1(x), \lambda ^{(1)}_1(x))} + \frac{1}{H_{\mathcal C^{sp}}(\lambda ^{(2)}(x))\mathrm{KL}(\mu _2(x), \lambda ^{(2)}_2(x))} \\ & = \frac{1}{H_{\mathcal C^{sp}}((x/2, x)) \mathrm{KL}(x(1+x), x/2)} + \frac{1}{H_{\mathcal C^{sp}}((x(1+x), 1/2))\mathrm{KL}(x, 1/2)} \: . \end{align*}

The limit of the quantity on the right when \(x \to 0\) is strictly greater than 1 (it is approximately 1.22).