Lower bounds for hypothesis testing based on information theory

4 The risk of simple binary hypothesis testing

The goal of this section is to obtain bounds on the risk of simple binary hypothesis testing \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n})\) or on \(\inf _{E} \max \{ \mu ^{\otimes n}(E), \nu ^{\otimes n}(E^c)\} \) that show how those quantities depends on the number of samples \(n \in \mathbb {N}\).

The main idea is to relate these to divergences that tensorize, or on which we have tensorization inequalities. Those divergences are \(\operatorname{KL}\), \(R_\alpha \), \(C_\alpha \), \(\operatorname{JS}_\alpha \) (and more generally the mutual information if we want to consider more than two hypotheses).

4.1 Bounding the Bayes risk with tensorizing divergences

4.1.1 Upper bound with the Chernoff information

Let \(\zeta \) be a measure such that \(\mu \ll \zeta \) and \(\nu \ll \zeta \). Let \(p = \frac{d \mu }{d\zeta }\) and \(q = \frac{d \nu }{d\zeta }\). For \(\alpha \in (0,1)\), for \(g_\alpha (x) = \min \{ (\alpha -1)x, \alpha x\} \),

\begin{align*} \mathcal B_\xi (\mu , \nu ) = e^{-(1 - \alpha ) R_\alpha (\xi _0\mu , \xi _1\nu )} \int _x \exp \left(g_\alpha \left( \log \frac{\xi _1 q(x)}{\xi _0 p(x)} \right)\right) \partial (\xi _0\mu )^{(\alpha , \xi _1\nu )} \: . \end{align*}
Proof
\begin{align*} \mathcal B_\xi (\mu , \nu ) & = \int _x \min \left\{ \xi _0 p(x), \xi _1 q(x)\right\} \partial \zeta \\ & = \int _x (\xi _0 p(x))^\alpha (\xi _1 q(x))^{1-\alpha } \min \left\{ \left(\frac{\xi _0p(x)}{\xi _1q(x)}\right)^{1 - \alpha }, \left(\frac{\xi _1q(x)}{\xi _0p(x)}\right)^\alpha \right\} \partial \zeta \\ & = e^{-(1 - \alpha ) R_\alpha (\xi _0\mu , \xi _1\nu )} \int _x \min \left\{ \left(\frac{\xi _1q(x)}{\xi _0p(x)}\right)^{\alpha - 1}, \left(\frac{\xi _1q(x)}{\xi _0p(x)}\right)^\alpha \right\} \partial (\xi _0\mu )^{(\alpha , \xi _1\nu )} \: . \end{align*}

For \(\alpha \in (0,1)\),

\begin{align*} \mathcal B_\xi (\mu , \nu ) \le e^{-(1 - \alpha ) R_\alpha (\xi _0\mu , \xi _1\nu )} = \xi _0^\alpha \xi _1^{1-\alpha } e^{-(1 - \alpha ) R_\alpha (\mu , \nu )} \: . \end{align*}
Proof

Use \(g_\alpha (x) \le 0\) in Lemma 4.1.1.

\begin{align*} \mathcal B_\xi (\mu , \nu ) \le e^{- C_1(\xi _0\mu , \xi _1\nu )} \: . \end{align*}
Proof

Optimize over \(\alpha \) in Corollary 4.1.2: the optimal value gives the Chernoff information thanks to Lemma 2.8.4.

For probability measures,

\begin{align*} 1 - \operatorname{TV}(\mu , \nu ) \le e^{- C_1(\mu , \nu )} \: . \end{align*}
Proof

By definition, since \(\mu \) and \(\nu \) are probability measures, \(1 - \operatorname{TV}(\mu , \nu ) = \mathcal B_{(1,1)}(\mu , \nu )\). Then apply Lemma 4.1.3.

4.1.2 Lower bounds using the data-processing inequality

The ideas behind the next two theorems are the same, up to the order of the arguments of a Kullback-Leibler divergence. The hidden parameter and data in simple binary hypothesis testing are generated from a distribution \(\pi \otimes P \in \mathcal P(\Theta \times \mathcal X)\). The reason that an estimator can get low risk is that \(P\) depends on the parameter: if it did not, the Bayes risk with prior \((\alpha , 1 - \alpha )\) would be \(\min \{ \alpha , 1 - \alpha \} \). We will compare \(\pi \otimes P\) to a product distribution \(\pi ' \times \xi \) for \(\pi ' \in \mathcal P(\Theta )\) and \(\xi \in \mathcal P(\mathcal X)\). In order to do that, we use the data-processing inequality for \(\operatorname{KL}\), starting either from \(\operatorname{KL}(\pi \otimes P, \pi ' \times \xi )\) or from \(\operatorname{KL}(\pi ' \times \xi , \pi \otimes P)\) (the two theorems correspond to the two choices). The data-processing inequality is used to say that those divergences are greater than the divergences between the losses of the two corresponding estimation tasks.

We start with some preparatory lemmas.

Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then

\begin{align*} \operatorname{KL}(\pi \otimes Q, \xi \otimes P) & \ge \operatorname{kl}(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) \: . \end{align*}
Proof

This is Lemma 2.3.62 specialized to the Kullback-Leibler divergence.

Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then

\begin{align*} \operatorname{KL}(\pi \otimes Q, \pi \otimes P) & \ge \operatorname{kl}(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) - \operatorname{KL}(\pi , \xi ) \: . \end{align*}
Proof

Use \(\operatorname{KL}(\pi \otimes Q, \xi \otimes P) = \operatorname{KL}(\pi , \xi ) + \operatorname{KL}(\pi \otimes Q, \pi \otimes P)\) in Lemma 4.1.5.

Let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then

\begin{align*} \operatorname{KL}(\pi _\beta \otimes Q, \pi _\alpha \otimes P) & \ge \operatorname{kl}(B_\beta (Q(0), Q(1)), B_\alpha (P(0), P(1))) \: . \end{align*}
Proof

Apply Lemma 4.1.5.

Corollary 4.1.8

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then

\begin{align*} \operatorname{KL}(\pi _\beta \otimes Q, \pi _\beta \otimes P) & \ge \operatorname{kl}(B_\beta (Q(0), Q(1)), B_\alpha (P(0), P(1))) - \operatorname{kl}(\beta , \alpha ) \: . \end{align*}
Proof

Apply Corollary 4.1.6.

Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Let \(\bar{\beta } = \min \{ \beta , 1 - \beta \} \). Then

\begin{align*} \operatorname{KL}(\pi _\alpha \otimes P, \pi _\alpha \times \xi ) & \ge \operatorname{kl}(B_\alpha (\mu , \nu ), \bar{\beta }) - \operatorname{kl}(\alpha , \bar{\beta }) \: , \\ \operatorname{KL}(\pi _\beta \times \xi , \pi _\beta \otimes P) & \ge \operatorname{kl}(\bar{\beta }, B_\alpha (\mu , \nu )) - \operatorname{kl}(\bar{\beta }, \alpha ) \: . \end{align*}
Proof

We apply Corollary 4.1.8 for the constant kernel with value \(\xi \) and use Lemma 1.2.6.

Theorem 4.1.10 Fano’s inequality, binary case

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\).

\begin{align*} h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) \le \operatorname{JS}_\alpha (\mu , \nu ) \: , \end{align*}

in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.

Proof

Apply Lemma 2.9.5 and then Lemma 4.1.9 with \(\beta = 1/2\) to get

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & = \inf _{\xi \in \mathcal P(\mathcal X)}\operatorname{KL}(\pi _\alpha \otimes P, \pi _\alpha \times \xi ) \\ & \ge \operatorname{kl}(B_\alpha (\mu , \nu ), 1/2) - \operatorname{kl}(\alpha , 1/2) \\ & = h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) \: . \end{align*}

Note that \(\beta = 1/2\) is the value that results in the best bound: the lower bound on \(\operatorname{JS}_\alpha (\mu , \nu )\) from some \(\beta \le 1/2\) would be

\begin{align*} \operatorname{kl}(B_\alpha (\mu , \nu ), \beta ) - \operatorname{kl}(\alpha , \beta ) & = h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) - (\alpha - B_\alpha (\mu , \nu )) \log \frac{1 - \beta }{\beta } \\ & \le h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) \: . \end{align*}

For \(\alpha , \beta \in (0, 1/2)\),

\begin{align*} \beta \log \frac{\alpha }{B_\alpha (\mu , \nu )} + (1 - \beta ) \log \frac{1 - \alpha }{1 - B_\alpha (\mu , \nu )} & \le (1 - \beta ) R_{\beta }(\mu , \nu ) \: . \end{align*}

As a consequence,

\begin{align*} \beta \log \frac{\alpha }{B_\alpha (\mu , \nu )} & \le (1 - \beta ) (R_{\beta }(\mu , \nu ) + \log 2) \: . \end{align*}

In particular, \(\log \frac{\alpha }{B_\alpha (\mu , \nu )} \le R_{1/2}(\mu , \nu ) + \log 2\) .

Proof

By Lemma 2.6.20 and then Lemma 4.1.9,

\begin{align*} (1 - \beta ) R_{\beta }(\mu , \nu ) & = \inf _{\xi \in \mathcal P(\mathcal X)} \operatorname{KL}(\pi _\beta \times \xi , \pi _\beta \otimes P) \\ & \ge \operatorname{kl}(\beta , B_\alpha (\mu , \nu )) - \operatorname{kl}(\beta , \alpha ) \\ & = \beta \log \frac{\alpha }{B_\alpha (\mu , \nu )} + (1 - \beta )\log \frac{1 - \alpha }{1 - B_\alpha (\mu , \nu )} \: . \end{align*}

The first particular case is obtained by using \(\alpha \le 1/2\) and \(B_\pi (\mu , \nu ) \ge 0\). The second one further sets \(\beta = 1/2\).

TODO: move or remove

Lemma 4.1.12

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Let \(\alpha \in (0,1)\). Then

\begin{align*} \mu (E)^\alpha + \nu (E^c)^{1 - \alpha } \ge \exp \left(-(1 - \alpha ) R_{\alpha }(\mu , \nu )\right) \: . \end{align*}
Proof

Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). By Lemma 2.6.9, \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\). That divergence is

\begin{align*} R_\alpha (\mu _E, \nu _E) & = \frac{1}{\alpha - 1}\log \left(\mu (E)^\alpha \nu (E)^{1 - \alpha } + \mu (E^c)^\alpha \nu (E^c)^{1 - \alpha }\right) \\ & \ge \frac{1}{\alpha - 1}\log \left(\mu (E)^\alpha + \nu (E^c)^{1 - \alpha }\right) \: . \end{align*}

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Then

\begin{align*} \sqrt{\mu (E)} + \sqrt{\nu (E^c)} \ge \exp \left(-\frac{1}{2} R_{1/2}(\mu , \nu )\right) = 1 - \operatorname{H^2}(\mu , \nu ) \: . \end{align*}
Proof

The inequality is an application of Lemma 4.1.12 for \(\alpha = 1/2\). The equality is Lemma 2.7.1.

4.1.3 Lower bounds using the change of measure lemma

The main tool of this section is the next lemma, which relates the probability of an event under two measures and deviations of the log-likelihood ratio between the measures.

Lemma 4.1.14 Change of measure lemma

Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then

\begin{align*} \nu (E) e^{\beta } \ge \mu (E) - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \: . \end{align*}
Proof
\begin{align*} \nu (E) \ge \mu \left[\mathbb {I}(E) e^{- \log \frac{d \mu }{d \nu } }\right] & \ge \mu \left[\mathbb {I}\left(E \cap \left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} \right) e^{- \log \frac{d \mu }{d \nu } }\right] \\ & \ge e^{- \beta }\mu \left(E \cap \left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} \right) \\ & \ge e^{- \beta }\left( \mu (E) - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \right) \: . \end{align*}
Lemma 4.1.15 Change of measure - functions

Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(f : \mathcal X \to [0,1]\) be a measurable function. Let \(\beta \in \mathbb {R}\). Then

\begin{align*} \nu [f] e^{\beta } \ge \mu [f] - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \: . \end{align*}
Proof
\begin{align*} \nu [f] \ge \mu \left[f e^{- \log \frac{d \mu }{d \nu } }\right] & \ge \mu \left[f \mathbb {I}\left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} e^{- \log \frac{d \mu }{d \nu } }\right] \\ & \ge e^{- \beta }\mu \left[f \mathbb {I}\left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} \right] \\ & \ge e^{- \beta }\left( \mu [f] - \mu \left[f \mathbb {I}\left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \right] \right) \\ & \ge e^{- \beta }\left( \mu [f] - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \right) \: . \end{align*}

The first and second inequalities use \(f \ge 0\) and the last inequality uses \(f \le 1\) .

Lemma 4.1.16

Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\) be such that \(\zeta \otimes Q \ll \pi \otimes P\). Then for all \(\beta \in \mathbb {R}\),

\begin{align*} \mathcal R_\pi ^P e^{\beta } \ge \mathcal R_\zeta ^Q - (\zeta \otimes Q)\left\{ \log \frac{d (\zeta \otimes Q)}{d (\pi \otimes P)} {\gt} \beta \right\} \: . \end{align*}
Proof

Let \(\hat{y}_B\) be a Bayes estimator for \((P, y, \ell ')\). If no such estimator exists, the proof can be adapted by taking an estimator with risk \(\varepsilon \)-close to the Bayes risk and then minimizing over \(\varepsilon {\gt} 0\). Apply Lemma 4.1.15 to the function \((\theta , x) \mapsto \hat{y}_B(x)\left[ z \mapsto \ell '(y(\theta ), z) \right]\) , which takes values in \([0,1]\) and to the measures \(\pi \otimes P\) and \(\zeta \otimes Q\).

\begin{align*} \mathcal R_\pi ^P e^{\beta } & \ge (\zeta \otimes Q)\left[ (\theta , x) \mapsto \hat{y}_B(x)\left[ z \mapsto \ell ’(y(\theta ), z) \right] \right] - (\zeta \otimes Q)\left\{ \log \frac{d (\zeta \otimes Q)}{d (\pi \otimes P)} {\gt} \beta \right\} \\ & \ge \mathcal R_\zeta ^Q - (\zeta \otimes Q)\left\{ \log \frac{d (\zeta \otimes Q)}{d (\pi \otimes P)} {\gt} \beta \right\} \: . \end{align*}

Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Then for all \(\beta \in \mathbb {R}\),

\begin{align*} \mathcal R_\pi ^P e^{\beta } \ge \mathcal R_\zeta ^{d_{\mathcal X}} - \inf _{\xi \in \mathcal P(\mathcal X)}(\zeta \times \xi )\left\{ \log \frac{d (\zeta \times \xi )}{d (\pi \otimes P)} {\gt} \beta \right\} \: , \end{align*}

in which the infimum over \(\xi \) is restricted to probability measures such that \(\zeta \times \xi \ll \pi \otimes P\) and \(d_{\mathcal X} : \Theta \rightsquigarrow *\) is the discard kernel.

Proof

For any \(\xi \), we apply Lemma 4.1.16 to \(\zeta \times \xi \) and \(\pi \otimes P\) and remark that by Lemma 1.1.3 the risk for a constant kernel with value \(\xi \) is independent of \(\xi \) and equal to \(\mathcal R_\zeta ^{d_{\mathcal X}}\).

Lemma 4.1.18 3 points change of measure

Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta _1, \beta _2 \in \mathbb {R}\). Then

\begin{align*} \mu (E) e^{\beta _1} + \nu (E^c) e^{\beta _2} \ge 1 - \xi \left\{ \log \frac{d \xi }{d \mu } {\gt} \beta _1 \right\} - \xi \left\{ \log \frac{d \xi }{d \nu } {\gt} \beta _2 \right\} \: . \end{align*}
Proof

Two applications of Lemma 4.1.14, then sum them and use \(\xi (E)+\xi (E^c) = 1\).

Change of measure and the moments of the log-likelihood ratio

Corollary 4.1.19 Change of measure - mean

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then

\begin{align*} \nu (E) e^{\operatorname{KL}(\mu , \nu ) + \beta } \ge \mu (E) - \mu \left\{ \log \frac{d \mu }{d \nu } - \operatorname{KL}(\mu , \nu ) {\gt} \beta \right\} \: . \end{align*}
Proof

Use Lemma 4.1.14 with the choice \(\operatorname{KL}(\mu , \nu ) + \beta \) for \(\beta \).

For \(\alpha \in (0,1)\), let \(\pi _\alpha \in \mathcal P(\{ 0,1\} )\) be the measure \((\alpha , 1 - \alpha )\). Let \(\alpha , \gamma \in (0,1)\), \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \) . Then for all \(\beta \in \mathbb {R}\) ,

\begin{align*} & \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\operatorname{kl}(\gamma , \alpha ) + (1 - \gamma ) R_\gamma (\mu , \nu ) + \beta } \\ & \ge \min \{ \gamma , 1 - \gamma \} - (\pi _\gamma \times \mu ^{(\gamma , \nu )})\left\{ \log \frac{d (\pi _\gamma \times \mu ^{(\gamma , \nu )})}{d (\pi _\alpha \otimes P)} {\gt} \operatorname{KL}(\pi _\gamma \times \mu ^{(\gamma , \nu )}, \pi _\alpha \otimes P) + \beta \right\} \: . \end{align*}
Proof

An application of Lemma 4.1.17 gives

\begin{align*} \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\beta } \ge \min \{ \gamma , 1 - \gamma \} - \inf _{\xi \in \mathcal P(\mathcal X)}(\pi _\gamma \times \xi )\left\{ \log \frac{d (\pi _\gamma \times \xi )}{d (\pi _\alpha \otimes P)} {\gt} \beta \right\} \: . \end{align*}

Take \(\xi = \mu ^{(\gamma , \nu )}\) and

\begin{align*} \beta & = \operatorname{KL}(\pi _\gamma \times \xi , \pi _\alpha \otimes P) + \beta ’ \\ & = \operatorname{kl}(\gamma , \alpha ) + \operatorname{KL}(\pi _\gamma \times \xi , \pi _\gamma \otimes P) + \beta ’ \\ & = \operatorname{kl}(\gamma , \alpha ) + (1 - \gamma ) R_\gamma (\mu , \nu ) + \beta ’ \: . \end{align*}

We obtain

\begin{align*} & \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\operatorname{kl}(\gamma , \alpha ) + (1 - \gamma ) R_\gamma (\mu , \nu ) + \beta '} \\ & \ge \min \{ \gamma , 1 - \gamma \} - (\pi _\gamma \times \xi )\left\{ \log \frac{d (\pi _\gamma \times \xi )}{d (\pi _\alpha \otimes P)} {\gt} \operatorname{KL}(\pi _\gamma \times \xi , \pi _\alpha \otimes P) + \beta ’ \right\} \: . \end{align*}
Lemma 4.1.21 Change of measure - variance

Let \(\mu , \nu \) be two measures on \(\mathcal X\) such that \(\mu \left[\left(\log \frac{d \mu }{d \nu }\right)^2\right] {\lt} \infty \). Let \(E\) be an event on \(\mathcal X\) and let \(\beta {\gt} 0\). Then

\begin{align*} \nu (E) e^{\operatorname{KL}(\mu , \nu ) + \sqrt{\operatorname{Var}_\mu [\log \frac{d \mu }{d \nu }]\beta }} \ge \mu (E) - \frac{1}{\beta } \: . \end{align*}
Proof

Use Lemma 4.1.14 with the choice \(\operatorname{KL}(\mu , \nu ) + \sqrt{\operatorname{Var}_\mu [\log \frac{d \mu }{d \nu }]\beta }\) for \(\beta \) and bound the probability of deviation of the log-likelihood ratio with Chebychev’s inequality.

Lemma 4.1.22 Change of measure - c.g.f.

For \(\mu , \nu \) finite measures and \(\alpha , \beta {\gt} 0\),

\begin{align*} \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} R_{1+\alpha }(\mu , \nu ) + \beta \right\} \le e^{- \alpha \beta } \: . \end{align*}
Proof

This is a Chernoff bound, using that the cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\mu \) has value \(\alpha R_{1+\alpha }(\mu , \nu )\) at \(\alpha \) by Lemma 2.6.7.

\begin{align*} \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} R_{1+\alpha }(\mu , \nu ) + \beta \right\} & = \mu \left\{ \exp \left(\alpha \log \frac{d \mu }{d \nu }\right) {\gt} \exp \left(\alpha R_{1+\alpha }(\mu , \nu ) + \alpha \beta \right) \right\} \\ & \le e^{-\alpha R_{1+\alpha }(\mu , \nu ) - \alpha \beta } \mu \left[\left(\frac{d \mu }{d \nu }\right)^\alpha \right] \end{align*}

Then \(\mu \left[\left(\frac{d \mu }{d \nu }\right)^\alpha \right] = \nu \left[\left(\frac{d \mu }{d \nu }\right)^{1+\alpha } \right] = e^{\alpha R_{1+\alpha }(\mu , \nu )}\).

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha ,\beta {\gt} 0\). Then

\begin{align*} \nu (E) e^{R_{1+\alpha }(\mu , \nu ) + \beta } \ge \mu (E) - e^{-\alpha \beta } \: . \end{align*}
Proof

Use Lemma 4.1.14 with the choice \(R_{1+\alpha }(\mu , \nu ) + \beta \) for \(\beta \). Then apply Lemma 4.1.22.

For \(\alpha \in (0,1)\), let \(\pi _\alpha \in \mathcal P(\{ 0,1\} )\) be the measure \((\alpha , 1 - \alpha )\). Let \(\alpha , \gamma \in (0,1)\), \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \) . Then for all \(\beta {\gt} 0\) and \(\varepsilon {\gt}0\) ,

\begin{align*} \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\inf _{\xi \in \mathcal P(\mathcal X)} R_{1 + \varepsilon }(\pi _\gamma \times \xi , \pi _\alpha \otimes P) + \beta } & \ge \min \{ \gamma , 1 - \gamma \} - e^{-\varepsilon \beta } \: . \end{align*}
Proof

Applications of the change of measure with 3 points

Lemma 4.1.25

Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). For \(\beta {\gt} 0\) ,

\begin{align*} \mu (E) e^{\operatorname{KL}(\xi , \mu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\mu }\right]}} + \nu (E^c) e^{\operatorname{KL}(\xi , \nu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\nu }\right]}} \ge 1 - \frac{2}{\beta } \: . \end{align*}
Proof

Use Lemma 4.1.18 with the choices \(\operatorname{KL}(\xi , \mu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\mu }\right]}\) and \(\operatorname{KL}(\xi , \nu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\nu }\right]}\) for \(\beta _1\) and \(\beta _2\). Then use Chebyshev’s inequality to bound the probabilities of deviation of the log-likelihood ratios.

Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha , \beta \ge 0\). Then

\begin{align*} \mu (E) e^{R_{1+\alpha }(\xi , \mu ) + \beta } + \nu (E^c) e^{R_{1+\alpha }(\xi , \nu ) + \beta } \ge 1 - 2 e^{-\alpha \beta } \: . \end{align*}
Proof

Use Lemma 4.1.18 with the choices \(R_{1+\alpha }(\xi , \mu ) + \beta \) and \(R_{1+\alpha }(\xi , \nu ) + \beta \) for \(\beta _1\) and \(\beta _2\). Then apply Lemma 4.1.22.

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha {\gt} 0\). Then

\begin{align*} \mu (E) + \nu (E^c) \ge \frac{1}{2}\exp \left( - C_{1+\alpha }(\mu , \nu ) - \frac{\log 4}{\alpha }\right) \: . \end{align*}
Proof

Use Lemma 4.1.26 with \(\beta = \log (4)/\alpha \) and use that

\begin{align*} \mu (E) e^{R_{1+\alpha }(\xi , \mu ) + \beta } + \nu (E^c) e^{R_{1+\alpha }(\xi , \nu ) + \beta } \le (\mu (E) + \nu (E^c)) e^{\max \{ R_{1+\alpha }(\xi , \mu ), R_{1+\alpha }(\xi , \nu )\} + \beta } \: . \end{align*}

Product spaces

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\), let \(n \in \mathbb {N}\) and let \(E\) be an event on \(\mathcal X^n\). For all \(\alpha {\gt} 0\),

\begin{align*} \mu ^{\otimes n}(E) + \nu ^{\otimes n}(E^c) \ge \frac{1}{2}\exp \left( - n C_{1+\frac{\alpha }{\sqrt{n}}}(\mu , \nu ) - \frac{\log 4}{\alpha }\sqrt{n}\right) \: . \end{align*}
Proof

Use Lemma 2.6.25 in Lemma 4.1.27. TODO: add a lemma for tensorization of the Chernoff divergence.

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \((E_n)_{n \in \mathbb {N}}\) be events on \(\mathcal X^n\). For all \(\gamma \in (0,1)\),

\begin{align*} \limsup _{n \to +\infty } \frac{1}{n}\log \frac{1}{\gamma \mu ^{\otimes n}(E_n) + (1 - \gamma )\nu ^{\otimes n}(E_n^c)} \le C_1(\mu , \nu ) \: . \end{align*}
Proof

Let \(\xi \) be a probability measure on \(\mathcal X\) and \(\beta {\gt} 0\). By Corollary 4.1.19,

\begin{align*} \mu ^{\otimes n}(E_n) e^{\operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}) + n\beta } & \ge \xi ^{\otimes n}(E_n) - \xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \mu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}) {\gt} n\beta \right\} \: , \\ \nu ^{\otimes n}(E_n^c) e^{\operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n}) + n\beta } & \ge \xi ^{\otimes n}(E_n^c) - \xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \nu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n}) {\gt} n\beta \right\} \: . \end{align*}

We sum both inequalities with weights \(\gamma \) and \(1-\gamma \) respectively and use that each \(\operatorname{KL}\) on the left is less than their max, as well as \(\xi ^{\otimes n}(E_n) + \xi ^{\otimes n}(E_n^c) = 1\).

\begin{align*} & e^{n\beta } (\gamma \mu ^{\otimes n}(E_n) + (1-\gamma )\nu ^{\otimes n}(E_n^c)) e^{\max \{ \operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}), \operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n})\} } \\ & \ge \min \{ \gamma , 1-\gamma \} - \gamma \xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \mu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}) {\gt} n\beta \right\} - (1 - \gamma )\xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \nu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n}) {\gt} n\beta \right\} \: . \end{align*}

Let \(p_{n,\mu }(\beta )\) and \(p_{n, \nu }(\beta )\) be the two probabilities on the right hand side. By the law of large numbers, both tend to 0 when \(n\) tends to \(+\infty \). In particular, for \(n\) large enough, the right hand side is positive and we can take logarithms on both sides. We also use the tensorization of \(\operatorname{KL}\) (Theorem 2.4.17).

\begin{align*} & n \max \{ \operatorname{KL}(\xi , \mu ), \operatorname{KL}(\xi , \nu )\} \\ & \ge \log \frac{1}{\gamma \mu ^{\otimes n}(E_n) + (1 - \gamma )\nu ^{\otimes n}(E_n^c)} + \log (\min \{ \gamma , 1-\gamma \} - \gamma p_{n, \mu }(\beta ) - (1 - \gamma ) p_{n, \nu }(\beta )) - n\beta \end{align*}

For \(n \to +\infty \),

\begin{align*} \max \{ \operatorname{KL}(\xi , \mu ), \operatorname{KL}(\xi , \nu )\} \ge \limsup _{n \to + \infty }\frac{1}{n}\log \frac{1}{\gamma \mu ^{\otimes n}(E_n) + (1 - \gamma )\nu ^{\otimes n}(E_n^c)} - \beta \end{align*}

Since \(\beta {\gt} 0\) is arbitrary, we can take a supremum over \(\beta \) on the right.

4.2 Sample complexity

We study the sample complexity of binary hypothesis testing.

The sample complexity of simple binary hypothesis testing with prior \((\pi , 1 - \pi ) \in \mathcal P(\{ 0, 1\} )\) at risk level \(\delta \in \mathbb {R}_{+, \infty }\) is

\begin{align*} n(\mu , \nu , \pi , \delta ) = \min \{ n \in \mathbb {N} \mid B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \} \: . \end{align*}

This is the sample complexity \(n_\xi ^P(\delta )\) of Definition 9 specialized to simple binary hypothesis testing.

Lemma 4.2.1

For \(\delta \ge \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing is \(n(\mu , \nu , \pi , \delta ) = 0\) .

Proof

The sample complexity of simple binary hypothesis testing satisfies \(n(\mu , \nu , \pi , \delta ) \le n_0\) , with \(n_0\) the smallest natural number such that

\begin{align*} \log \frac{1}{\delta } \le \sup _{\alpha \in (0,1)} \left( n_0 (1 - \alpha )R_\alpha (\mu , \nu ) + \alpha \log \frac{1}{\pi } + (1 - \alpha )\log \frac{1}{1 - \pi } \right) \end{align*}
Proof

It suffices to show that \(B_\pi (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) \le \delta \) . By Corollary 4.1.2, for all \(\alpha \in (0,1)\),

\begin{align*} B_\pi (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) & \le \pi ^{\alpha } (1 - \pi )^{1 - \alpha }e^{- (1 - \alpha )R_\alpha (\mu ^{\otimes n_0}, \nu ^{\otimes n_0})} \: . \end{align*}

The Rényi divergence tensorizes: \(R_\alpha (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) = n_0 R_\alpha (\mu , \nu )\) .

\begin{align*} B_\pi (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) & \le \exp \left(- \left( n_0 (1 - \alpha )R_\alpha (\mu , \nu ) + \alpha \log \frac{1}{\pi } + (1 - \alpha )\log \frac{1}{1 - \pi } \right) \right) \: . \end{align*}

By definition, \(n_0\) is such that the infimum over \(\alpha \) of the right hand side is less than \(\delta \).

For \(\delta \le \pi \le 1/2\), the sample complexity of simple binary hypothesis testing satisfies

\begin{align*} n(\mu , \nu , \pi , \delta ) \ge \frac{\log \frac{\pi }{2\delta }}{R_{1/2}(\mu , \nu )} \: . \end{align*}
Proof

By Theorem 4.1.11,

\begin{align*} \log \frac{\pi }{B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n})} \le R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n}) + \log 2 \: . \end{align*}

The Rényi divergence tensorizes (Theorem 2.6.25): \(R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_{1/2}(\mu , \nu )\) . We finally use that \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \) for \(n = n(\mu , \nu , \pi , \delta )\) .

For \(\delta \le \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing satisfies

\begin{align*} n(\mu , \nu , \pi , \delta ) \ge \frac{h_2(\pi ) - h_2(\delta )}{\operatorname{JS}_\pi (\mu , \nu )} \: , \end{align*}

in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.

Proof

We start from Theorem 4.1.10.

\begin{align*} \operatorname{JS}_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \ge h_2(\pi ) - h_2(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n})) \: . \end{align*}

By Lemma 2.9.6, \(\operatorname{JS}_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le n \operatorname{JS}_\pi (\mu , \nu )\). The final result is obtained by using that \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \) for \(n = n(\mu , \nu , \pi , \delta )\).

4.3 Sequential testing

For \(\mu \in \mathcal M(\mathcal X)\), let \(\mu ^{\otimes \mathbb {N}}\) be the corresponding product measure on \(\mathcal X^{\mathbb {N}}\). For \(n \in \mathbb {N}\), let \(\mathcal F_n\) be the \(\sigma \)-algebra of \(\mathcal X^{\mathbb {N}}\) generated by the projection to the first \(n\) coordinates. Denote that projection by \(\pi _{[n]}\). \(\mathcal F = (\mathcal F_n)_{n \in \mathbb {N}}\) is a filtration. Let \(\tau \) be a stopping time with respect to \(\mathcal F\). \(\mu _\tau \) denotes the restriction of the measure \(\mu ^{\otimes \mathbb {N}}\) to \(\mathcal F_\tau \), the sigma-algebra generated by the stopping time \(\tau \).

In sequential testing, two components need to be designed for estimation: the stopping time \(\tau \) and the \(\mathcal F_\tau \)-measurable estimator that returns an answer given the data available at \(\tau \). We will be interested in two performance metrics:

  • The risk, where the data generating kernel is \(P_\tau \) which takes values \(\mu _\tau \) and \(\nu _\tau \).

  • The stopping time \(\tau \) (through its mean, other moments, or high probability bounds).

The idea is that a good estimator stops quickly with low risk.

Lemma 4.3.1

For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \) and \(n \in \mathbb {N}\), \(\nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}\)-almost surely, \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}(x) = \prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\).

Proof

By Lemma B.3.2, \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}(x) = \frac{d \pi _{[n]*}\mu ^{\otimes \mathbb {N}}}{d \pi _{[n]*}\nu ^{\otimes \mathbb {N}}}(\pi _{[n]}(x))\) a.s. . The pushforwards simplify: \(\pi _{[n]*}\nu ^{\otimes \mathbb {N}} = \nu ^{\otimes n}\).

Lemma 4.3.2

For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \), \(\nu _\tau \)-almost surely, \(\frac{d \mu _\tau }{d \nu _\tau }(x) = \prod _{n=1}^\tau \frac{d \mu }{d \nu }(x_n)\).

Proof

It suffices to show the equality of their integrals on \(\mathcal F_\tau \)-measurable sets. Let \(E\) be such a set. Then \(E = \sum _{n=1}^\infty E \cap \{ \tau = n\} \) . For \(n \in \mathbb {N}\), let \(E_n = E \cap \{ \tau = n\} \).

We show that \(\nu _\tau \left[\mathbb {I}(E_n) \frac{d \mu _\tau }{d \nu _\tau }\right] = \nu _\tau \left[x \mapsto \mathbb {I}(E_n)\prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\right]\). The result for \(E\) then follows from the monotone convergence theorem. The product \(\prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\) is equal to \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}\) by Lemma 4.3.1.

\(E_n\) is \(\mathcal F_n\)-measurable, hence (using Lemma B.3.1) TODO

Theorem 4.3.3

For \(\mu , \nu \in \mathcal P(\mathcal X)\), \(\operatorname{KL}(\mu _\tau , \nu _\tau ) = \mu [\tau ] \operatorname{KL}(\mu , \nu )\).

Proof

TODO: need Wald’s first identity.

Lemma 4.3.4

For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \((\mu ^{(\alpha , \nu )})_\tau = (\mu _\tau )^{(\alpha , \nu _\tau )}\).

Proof

This is a guess, I did not check it. TODO

For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \(R_\alpha (\mu _\tau , \nu _\tau ) = \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu )\).

Proof

From Corollary 2.6.18, \((1 - \alpha ) R_\alpha (\mu , \nu ) = \alpha \operatorname{KL}(\mu ^{(\alpha , \nu )}, \mu ) + (1 - \alpha )\operatorname{KL}(\mu ^{(\alpha , \nu )}, \nu )\), hence we can use Theorem 4.3.3 to write

\begin{align*} \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu ) & = \frac{1}{1 - \alpha }\left(\alpha \mu ^{\alpha , \nu }[\tau ]\operatorname{KL}(\mu ^{(\alpha , \nu )}, \mu ) + (1 - \alpha )\mu ^{\alpha , \nu }[\tau ]\operatorname{KL}(\mu ^{(\alpha , \nu )}, \nu ) \right) \\ & = \frac{1}{1 - \alpha }\left(\alpha \operatorname{KL}((\mu ^{(\alpha , \nu )})_\tau , \mu _\tau ) + (1 - \alpha )\operatorname{KL}((\mu ^{(\alpha , \nu )})_\tau , \nu _\tau ) \right) \end{align*}

It then suffices to remark that \((\mu ^{(\alpha , \nu )})_\tau = (\mu _\tau )^{(\alpha , \nu _\tau )}\), thanks to Lemma 4.3.4.

TODO: if that lemma does not hold, we still have the inequality \(R_\alpha (\mu _\tau , \nu _\tau ) \le \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu )\) by Lemma 2.6.19.