4 The risk of simple binary hypothesis testing

The goal of this section is to obtain bounds on the risk of simple binary hypothesis testing \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n})\) or on \(\inf _{E} \max \{ \mu ^{\otimes n}(E), \nu ^{\otimes n}(E^c)\} \) that show how those quantities depends on the number of samples \(n \in \mathbb {N}\).

The main idea is to relate these to divergences that tensorize, or on which we have tensorization inequalities. Those divergences are \(\operatorname{KL}\), \(R_\alpha \), \(C_\alpha \), \(\operatorname{JS}_\alpha \) (and more generally the mutual information if we want to consider more than two hypotheses).

4.1 Bounding the Bayes risk with tensorizing divergences

4.1.1 Upper bound with the Chernoff information

Lemma 4.1.1 [ ZL18 ]

Let \(\zeta \) be a measure such that \(\mu \ll \zeta \) and \(\nu \ll \zeta \). Let \(p = \frac{d \mu }{d\zeta }\) and \(q = \frac{d \nu }{d\zeta }\). For \(\alpha \in (0,1)\), for \(g_\alpha (x) = \min \{ (\alpha -1)x, \alpha x\} \),

\begin{align*} \mathcal B_\xi (\mu , \nu ) = e^{-(1 - \alpha ) R_\alpha (\xi _0\mu , \xi _1\nu )} \int _x \exp \left(g_\alpha \left( \log \frac{\xi _1 q(x)}{\xi _0 p(x)} \right)\right) \partial (\xi _0\mu )^{(\alpha , \xi _1\nu )} \: . \end{align*}

Proof ▶

\begin{align*} \mathcal B_\xi (\mu , \nu ) & = \int _x \min \left\{ \xi _0 p(x), \xi _1 q(x)\right\} \partial \zeta \\ & = \int _x (\xi _0 p(x))^\alpha (\xi _1 q(x))^{1-\alpha } \min \left\{ \left(\frac{\xi _0p(x)}{\xi _1q(x)}\right)^{1 - \alpha }, \left(\frac{\xi _1q(x)}{\xi _0p(x)}\right)^\alpha \right\} \partial \zeta \\ & = e^{-(1 - \alpha ) R_\alpha (\xi _0\mu , \xi _1\nu )} \int _x \min \left\{ \left(\frac{\xi _1q(x)}{\xi _0p(x)}\right)^{\alpha - 1}, \left(\frac{\xi _1q(x)}{\xi _0p(x)}\right)^\alpha \right\} \partial (\xi _0\mu )^{(\alpha , \xi _1\nu )} \: . \end{align*}

Corollary 4.1.2

For \(\alpha \in (0,1)\),

\begin{align*} \mathcal B_\xi (\mu , \nu ) \le e^{-(1 - \alpha ) R_\alpha (\xi _0\mu , \xi _1\nu )} = \xi _0^\alpha \xi _1^{1-\alpha } e^{-(1 - \alpha ) R_\alpha (\mu , \nu )} \: . \end{align*}

Proof ▶

Use \(g_\alpha (x) \le 0\) in Lemma 4.1.1.

Lemma 4.1.3

\begin{align*} \mathcal B_\xi (\mu , \nu ) \le e^{- C_1(\xi _0\mu , \xi _1\nu )} \: . \end{align*}

Proof ▶

Optimize over \(\alpha \) in Corollary 4.1.2: the optimal value gives the Chernoff information thanks to Lemma 2.8.4.

Lemma 4.1.4

For probability measures,

\begin{align*} 1 - \operatorname{TV}(\mu , \nu ) \le e^{- C_1(\mu , \nu )} \: . \end{align*}

Proof ▶

By definition, since \(\mu \) and \(\nu \) are probability measures, \(1 - \operatorname{TV}(\mu , \nu ) = \mathcal B_{(1,1)}(\mu , \nu )\). Then apply Lemma 4.1.3.

4.1.2 Lower bounds using the data-processing inequality

The ideas behind the next two theorems are the same, up to the order of the arguments of a Kullback-Leibler divergence. The hidden parameter and data in simple binary hypothesis testing are generated from a distribution \(\pi \otimes P \in \mathcal P(\Theta \times \mathcal X)\). The reason that an estimator can get low risk is that \(P\) depends on the parameter: if it did not, the Bayes risk with prior \((\alpha , 1 - \alpha )\) would be \(\min \{ \alpha , 1 - \alpha \} \). We will compare \(\pi \otimes P\) to a product distribution \(\pi ' \times \xi \) for \(\pi ' \in \mathcal P(\Theta )\) and \(\xi \in \mathcal P(\mathcal X)\). In order to do that, we use the data-processing inequality for \(\operatorname{KL}\), starting either from \(\operatorname{KL}(\pi \otimes P, \pi ' \times \xi )\) or from \(\operatorname{KL}(\pi ' \times \xi , \pi \otimes P)\) (the two theorems correspond to the two choices). The data-processing inequality is used to say that those divergences are greater than the divergences between the losses of the two corresponding estimation tasks.

We start with some preparatory lemmas.

Lemma 4.1.5

Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then

\begin{align*} \operatorname{KL}(\pi \otimes Q, \xi \otimes P) & \ge \operatorname{kl}(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) \: . \end{align*}

Proof ▶

This is Lemma 2.3.63 specialized to the Kullback-Leibler divergence.

Corollary 4.1.6

Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then

\begin{align*} \operatorname{KL}(\pi \otimes Q, \pi \otimes P) & \ge \operatorname{kl}(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) - \operatorname{KL}(\pi , \xi ) \: . \end{align*}

Proof ▶

Use \(\operatorname{KL}(\pi \otimes Q, \xi \otimes P) = \operatorname{KL}(\pi , \xi ) + \operatorname{KL}(\pi \otimes Q, \pi \otimes P)\) in Lemma 4.1.5.

Corollary 4.1.7

Let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then

\begin{align*} \operatorname{KL}(\pi _\beta \otimes Q, \pi _\alpha \otimes P) & \ge \operatorname{kl}(B_\beta (Q(0), Q(1)), B_\alpha (P(0), P(1))) \: . \end{align*}

Proof ▶

Apply Lemma 4.1.5.

Corollary 4.1.8

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then

\begin{align*} \operatorname{KL}(\pi _\beta \otimes Q, \pi _\beta \otimes P) & \ge \operatorname{kl}(B_\beta (Q(0), Q(1)), B_\alpha (P(0), P(1))) - \operatorname{kl}(\beta , \alpha ) \: . \end{align*}

Proof ▶

Apply Corollary 4.1.6.

Lemma 4.1.9

Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Let \(\bar{\beta } = \min \{ \beta , 1 - \beta \} \). Then

\begin{align*} \operatorname{KL}(\pi _\alpha \otimes P, \pi _\alpha \times \xi ) & \ge \operatorname{kl}(B_\alpha (\mu , \nu ), \bar{\beta }) - \operatorname{kl}(\alpha , \bar{\beta }) \: , \\ \operatorname{KL}(\pi _\beta \times \xi , \pi _\beta \otimes P) & \ge \operatorname{kl}(\bar{\beta }, B_\alpha (\mu , \nu )) - \operatorname{kl}(\bar{\beta }, \alpha ) \: . \end{align*}

Proof ▶

We apply Corollary 4.1.8 for the constant kernel with value \(\xi \) and use Lemma 1.2.6.

Theorem 4.1.10 Fano’s inequality, binary case

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\).

\begin{align*} h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) \le \operatorname{JS}_\alpha (\mu , \nu ) \: , \end{align*}

in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.

Proof ▶

Apply Lemma 2.9.5 and then Lemma 4.1.9 with \(\beta = 1/2\) to get

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & = \inf _{\xi \in \mathcal P(\mathcal X)}\operatorname{KL}(\pi _\alpha \otimes P, \pi _\alpha \times \xi ) \\ & \ge \operatorname{kl}(B_\alpha (\mu , \nu ), 1/2) - \operatorname{kl}(\alpha , 1/2) \\ & = h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) \: . \end{align*}

Note that \(\beta = 1/2\) is the value that results in the best bound: the lower bound on \(\operatorname{JS}_\alpha (\mu , \nu )\) from some \(\beta \le 1/2\) would be

\begin{align*} \operatorname{kl}(B_\alpha (\mu , \nu ), \beta ) - \operatorname{kl}(\alpha , \beta ) & = h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) - (\alpha - B_\alpha (\mu , \nu )) \log \frac{1 - \beta }{\beta } \\ & \le h_2(\alpha ) - h_2(B_\alpha (\mu , \nu )) \: . \end{align*}

Theorem 4.1.11

For \(\alpha , \beta \in (0, 1/2)\),

\begin{align*} \beta \log \frac{\alpha }{B_\alpha (\mu , \nu )} + (1 - \beta ) \log \frac{1 - \alpha }{1 - B_\alpha (\mu , \nu )} & \le (1 - \beta ) R_{\beta }(\mu , \nu ) \: . \end{align*}

As a consequence,

\begin{align*} \beta \log \frac{\alpha }{B_\alpha (\mu , \nu )} & \le (1 - \beta ) (R_{\beta }(\mu , \nu ) + \log 2) \: . \end{align*}

In particular, \(\log \frac{\alpha }{B_\alpha (\mu , \nu )} \le R_{1/2}(\mu , \nu ) + \log 2\) .

Proof ▶

By Lemma 2.6.20 and then Lemma 4.1.9,

\begin{align*} (1 - \beta ) R_{\beta }(\mu , \nu ) & = \inf _{\xi \in \mathcal P(\mathcal X)} \operatorname{KL}(\pi _\beta \times \xi , \pi _\beta \otimes P) \\ & \ge \operatorname{kl}(\beta , B_\alpha (\mu , \nu )) - \operatorname{kl}(\beta , \alpha ) \\ & = \beta \log \frac{\alpha }{B_\alpha (\mu , \nu )} + (1 - \beta )\log \frac{1 - \alpha }{1 - B_\alpha (\mu , \nu )} \: . \end{align*}

The first particular case is obtained by using \(\alpha \le 1/2\) and \(B_\pi (\mu , \nu ) \ge 0\). The second one further sets \(\beta = 1/2\).

TODO: move or remove

Lemma 4.1.12

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Let \(\alpha \in (0,1)\). Then

\begin{align*} \mu (E)^\alpha + \nu (E^c)^{1 - \alpha } \ge \exp \left(-(1 - \alpha ) R_{\alpha }(\mu , \nu )\right) \: . \end{align*}

Proof ▶

Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). By Lemma 2.6.9, \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\). That divergence is

\begin{align*} R_\alpha (\mu _E, \nu _E) & = \frac{1}{\alpha - 1}\log \left(\mu (E)^\alpha \nu (E)^{1 - \alpha } + \mu (E^c)^\alpha \nu (E^c)^{1 - \alpha }\right) \\ & \ge \frac{1}{\alpha - 1}\log \left(\mu (E)^\alpha + \nu (E^c)^{1 - \alpha }\right) \: . \end{align*}

Corollary 4.1.13

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Then

\begin{align*} \sqrt{\mu (E)} + \sqrt{\nu (E^c)} \ge \exp \left(-\frac{1}{2} R_{1/2}(\mu , \nu )\right) = 1 - \operatorname{H^2}(\mu , \nu ) \: . \end{align*}

Proof ▶

The inequality is an application of Lemma 4.1.12 for \(\alpha = 1/2\). The equality is Lemma 2.7.1.

4.1.3 Lower bounds using the change of measure lemma

The main tool of this section is the next lemma, which relates the probability of an event under two measures and deviations of the log-likelihood ratio between the measures.

Lemma 4.1.14 Change of measure lemma

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then

\begin{align*} \nu (E) e^{\beta } \ge \mu (E) - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \: . \end{align*}

Proof ▶

\begin{align*} \nu (E) \ge \mu \left[\mathbb {I}(E) e^{- \log \frac{d \mu }{d \nu } }\right] & \ge \mu \left[\mathbb {I}\left(E \cap \left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} \right) e^{- \log \frac{d \mu }{d \nu } }\right] \\ & \ge e^{- \beta }\mu \left(E \cap \left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} \right) \\ & \ge e^{- \beta }\left( \mu (E) - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \right) \: . \end{align*}

Lemma 4.1.15 Change of measure - functions

Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(f : \mathcal X \to [0,1]\) be a measurable function. Let \(\beta \in \mathbb {R}\). Then

\begin{align*} \nu [f] e^{\beta } \ge \mu [f] - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \: . \end{align*}

Proof ▶

\begin{align*} \nu [f] \ge \mu \left[f e^{- \log \frac{d \mu }{d \nu } }\right] & \ge \mu \left[f \mathbb {I}\left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} e^{- \log \frac{d \mu }{d \nu } }\right] \\ & \ge e^{- \beta }\mu \left[f \mathbb {I}\left\{ \log \frac{d \mu }{d \nu } \le \beta \right\} \right] \\ & \ge e^{- \beta }\left( \mu [f] - \mu \left[f \mathbb {I}\left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \right] \right) \\ & \ge e^{- \beta }\left( \mu [f] - \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} \beta \right\} \right) \: . \end{align*}

The first and second inequalities use \(f \ge 0\) and the last inequality uses \(f \le 1\) .

Lemma 4.1.16

Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\) be such that \(\zeta \otimes Q \ll \pi \otimes P\). Then for all \(\beta \in \mathbb {R}\),

\begin{align*} \mathcal R_\pi ^P e^{\beta } \ge \mathcal R_\zeta ^Q - (\zeta \otimes Q)\left\{ \log \frac{d (\zeta \otimes Q)}{d (\pi \otimes P)} {\gt} \beta \right\} \: . \end{align*}

Proof ▶

Let \(\hat{y}_B\) be a Bayes estimator for \((P, y, \ell ')\). If no such estimator exists, the proof can be adapted by taking an estimator with risk \(\varepsilon \)-close to the Bayes risk and then minimizing over \(\varepsilon {\gt} 0\). Apply Lemma 4.1.15 to the function \((\theta , x) \mapsto \hat{y}_B(x)\left[ z \mapsto \ell '(y(\theta ), z) \right]\) , which takes values in \([0,1]\) and to the measures \(\pi \otimes P\) and \(\zeta \otimes Q\).

\begin{align*} \mathcal R_\pi ^P e^{\beta } & \ge (\zeta \otimes Q)\left[ (\theta , x) \mapsto \hat{y}_B(x)\left[ z \mapsto \ell ’(y(\theta ), z) \right] \right] - (\zeta \otimes Q)\left\{ \log \frac{d (\zeta \otimes Q)}{d (\pi \otimes P)} {\gt} \beta \right\} \\ & \ge \mathcal R_\zeta ^Q - (\zeta \otimes Q)\left\{ \log \frac{d (\zeta \otimes Q)}{d (\pi \otimes P)} {\gt} \beta \right\} \: . \end{align*}

Lemma 4.1.17

Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Then for all \(\beta \in \mathbb {R}\),

\begin{align*} \mathcal R_\pi ^P e^{\beta } \ge \mathcal R_\zeta ^{d_{\mathcal X}} - \inf _{\xi \in \mathcal P(\mathcal X)}(\zeta \times \xi )\left\{ \log \frac{d (\zeta \times \xi )}{d (\pi \otimes P)} {\gt} \beta \right\} \: , \end{align*}

in which the infimum over \(\xi \) is restricted to probability measures such that \(\zeta \times \xi \ll \pi \otimes P\) and \(d_{\mathcal X} : \Theta \rightsquigarrow *\) is the discard kernel.

Proof ▶

For any \(\xi \), we apply Lemma 4.1.16 to \(\zeta \times \xi \) and \(\pi \otimes P\) and remark that by Lemma 1.1.3 the risk for a constant kernel with value \(\xi \) is independent of \(\xi \) and equal to \(\mathcal R_\zeta ^{d_{\mathcal X}}\).

Lemma 4.1.18 3 points change of measure

✓

Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta _1, \beta _2 \in \mathbb {R}\). Then

\begin{align*} \mu (E) e^{\beta _1} + \nu (E^c) e^{\beta _2} \ge 1 - \xi \left\{ \log \frac{d \xi }{d \mu } {\gt} \beta _1 \right\} - \xi \left\{ \log \frac{d \xi }{d \nu } {\gt} \beta _2 \right\} \: . \end{align*}

Proof ▶

Two applications of Lemma 4.1.14, then sum them and use \(\xi (E)+\xi (E^c) = 1\).

Change of measure and the moments of the log-likelihood ratio

Corollary 4.1.19 Change of measure - mean

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then

\begin{align*} \nu (E) e^{\operatorname{KL}(\mu , \nu ) + \beta } \ge \mu (E) - \mu \left\{ \log \frac{d \mu }{d \nu } - \operatorname{KL}(\mu , \nu ) {\gt} \beta \right\} \: . \end{align*}

Proof ▶

Use Lemma 4.1.14 with the choice \(\operatorname{KL}(\mu , \nu ) + \beta \) for \(\beta \).

Lemma 4.1.20

For \(\alpha \in (0,1)\), let \(\pi _\alpha \in \mathcal P(\{ 0,1\} )\) be the measure \((\alpha , 1 - \alpha )\). Let \(\alpha , \gamma \in (0,1)\), \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \) . Then for all \(\beta \in \mathbb {R}\) ,

\begin{align*} & \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\operatorname{kl}(\gamma , \alpha ) + (1 - \gamma ) R_\gamma (\mu , \nu ) + \beta } \\ & \ge \min \{ \gamma , 1 - \gamma \} - (\pi _\gamma \times \mu ^{(\gamma , \nu )})\left\{ \log \frac{d (\pi _\gamma \times \mu ^{(\gamma , \nu )})}{d (\pi _\alpha \otimes P)} {\gt} \operatorname{KL}(\pi _\gamma \times \mu ^{(\gamma , \nu )}, \pi _\alpha \otimes P) + \beta \right\} \: . \end{align*}

Proof ▶

An application of Lemma 4.1.17 gives

\begin{align*} \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\beta } \ge \min \{ \gamma , 1 - \gamma \} - \inf _{\xi \in \mathcal P(\mathcal X)}(\pi _\gamma \times \xi )\left\{ \log \frac{d (\pi _\gamma \times \xi )}{d (\pi _\alpha \otimes P)} {\gt} \beta \right\} \: . \end{align*}

Take \(\xi = \mu ^{(\gamma , \nu )}\) and

\begin{align*} \beta & = \operatorname{KL}(\pi _\gamma \times \xi , \pi _\alpha \otimes P) + \beta ’ \\ & = \operatorname{kl}(\gamma , \alpha ) + \operatorname{KL}(\pi _\gamma \times \xi , \pi _\gamma \otimes P) + \beta ’ \\ & = \operatorname{kl}(\gamma , \alpha ) + (1 - \gamma ) R_\gamma (\mu , \nu ) + \beta ’ \: . \end{align*}

We obtain

\begin{align*} & \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\operatorname{kl}(\gamma , \alpha ) + (1 - \gamma ) R_\gamma (\mu , \nu ) + \beta '} \\ & \ge \min \{ \gamma , 1 - \gamma \} - (\pi _\gamma \times \xi )\left\{ \log \frac{d (\pi _\gamma \times \xi )}{d (\pi _\alpha \otimes P)} {\gt} \operatorname{KL}(\pi _\gamma \times \xi , \pi _\alpha \otimes P) + \beta ’ \right\} \: . \end{align*}

Lemma 4.1.21 Change of measure - variance

Let \(\mu , \nu \) be two measures on \(\mathcal X\) such that \(\mu \left[\left(\log \frac{d \mu }{d \nu }\right)^2\right] {\lt} \infty \). Let \(E\) be an event on \(\mathcal X\) and let \(\beta {\gt} 0\). Then

\begin{align*} \nu (E) e^{\operatorname{KL}(\mu , \nu ) + \sqrt{\operatorname{Var}_\mu [\log \frac{d \mu }{d \nu }]\beta }} \ge \mu (E) - \frac{1}{\beta } \: . \end{align*}

Proof ▶

Use Lemma 4.1.14 with the choice \(\operatorname{KL}(\mu , \nu ) + \sqrt{\operatorname{Var}_\mu [\log \frac{d \mu }{d \nu }]\beta }\) for \(\beta \) and bound the probability of deviation of the log-likelihood ratio with Chebychev’s inequality.

Lemma 4.1.22 Change of measure - c.g.f.

✓

For \(\mu , \nu \) finite measures and \(\alpha , \beta {\gt} 0\),

\begin{align*} \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} R_{1+\alpha }(\mu , \nu ) + \beta \right\} \le e^{- \alpha \beta } \: . \end{align*}

Proof ▶

This is a Chernoff bound, using that the cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\mu \) has value \(\alpha R_{1+\alpha }(\mu , \nu )\) at \(\alpha \) by Lemma 2.6.7.

\begin{align*} \mu \left\{ \log \frac{d \mu }{d \nu } {\gt} R_{1+\alpha }(\mu , \nu ) + \beta \right\} & = \mu \left\{ \exp \left(\alpha \log \frac{d \mu }{d \nu }\right) {\gt} \exp \left(\alpha R_{1+\alpha }(\mu , \nu ) + \alpha \beta \right) \right\} \\ & \le e^{-\alpha R_{1+\alpha }(\mu , \nu ) - \alpha \beta } \mu \left[\left(\frac{d \mu }{d \nu }\right)^\alpha \right] \end{align*}

Then \(\mu \left[\left(\frac{d \mu }{d \nu }\right)^\alpha \right] = \nu \left[\left(\frac{d \mu }{d \nu }\right)^{1+\alpha } \right] = e^{\alpha R_{1+\alpha }(\mu , \nu )}\).

Lemma 4.1.23

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha ,\beta {\gt} 0\). Then

\begin{align*} \nu (E) e^{R_{1+\alpha }(\mu , \nu ) + \beta } \ge \mu (E) - e^{-\alpha \beta } \: . \end{align*}

Proof ▶

Use Lemma 4.1.14 with the choice \(R_{1+\alpha }(\mu , \nu ) + \beta \) for \(\beta \). Then apply Lemma 4.1.22.

Lemma 4.1.24

\begin{align*} \mathcal B_{\pi _\alpha }(\mu , \nu ) e^{\inf _{\xi \in \mathcal P(\mathcal X)} R_{1 + \varepsilon }(\pi _\gamma \times \xi , \pi _\alpha \otimes P) + \beta } & \ge \min \{ \gamma , 1 - \gamma \} - e^{-\varepsilon \beta } \: . \end{align*}

Proof ▶

Applications of the change of measure with 3 points

Lemma 4.1.25

Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). For \(\beta {\gt} 0\) ,

\begin{align*} \mu (E) e^{\operatorname{KL}(\xi , \mu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\mu }\right]}} + \nu (E^c) e^{\operatorname{KL}(\xi , \nu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\nu }\right]}} \ge 1 - \frac{2}{\beta } \: . \end{align*}

Proof ▶

Use Lemma 4.1.18 with the choices \(\operatorname{KL}(\xi , \mu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\mu }\right]}\) and \(\operatorname{KL}(\xi , \nu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\nu }\right]}\) for \(\beta _1\) and \(\beta _2\). Then use Chebyshev’s inequality to bound the probabilities of deviation of the log-likelihood ratios.

Lemma 4.1.26

✓

Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha , \beta \ge 0\). Then

\begin{align*} \mu (E) e^{R_{1+\alpha }(\xi , \mu ) + \beta } + \nu (E^c) e^{R_{1+\alpha }(\xi , \nu ) + \beta } \ge 1 - 2 e^{-\alpha \beta } \: . \end{align*}

Proof ▶

Use Lemma 4.1.18 with the choices \(R_{1+\alpha }(\xi , \mu ) + \beta \) and \(R_{1+\alpha }(\xi , \nu ) + \beta \) for \(\beta _1\) and \(\beta _2\). Then apply Lemma 4.1.22.

Lemma 4.1.27

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha {\gt} 0\). Then

\begin{align*} \mu (E) + \nu (E^c) \ge \frac{1}{2}\exp \left( - C_{1+\alpha }(\mu , \nu ) - \frac{\log 4}{\alpha }\right) \: . \end{align*}

Proof ▶

Use Lemma 4.1.26 with \(\beta = \log (4)/\alpha \) and use that

\begin{align*} \mu (E) e^{R_{1+\alpha }(\xi , \mu ) + \beta } + \nu (E^c) e^{R_{1+\alpha }(\xi , \nu ) + \beta } \le (\mu (E) + \nu (E^c)) e^{\max \{ R_{1+\alpha }(\xi , \mu ), R_{1+\alpha }(\xi , \nu )\} + \beta } \: . \end{align*}

Product spaces

Lemma 4.1.28

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\), let \(n \in \mathbb {N}\) and let \(E\) be an event on \(\mathcal X^n\). For all \(\alpha {\gt} 0\),

\begin{align*} \mu ^{\otimes n}(E) + \nu ^{\otimes n}(E^c) \ge \frac{1}{2}\exp \left( - n C_{1+\frac{\alpha }{\sqrt{n}}}(\mu , \nu ) - \frac{\log 4}{\alpha }\sqrt{n}\right) \: . \end{align*}

Proof ▶

Use Lemma 2.6.25 in Lemma 4.1.27. TODO: add a lemma for tensorization of the Chernoff divergence.

Theorem 4.1.29

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \((E_n)_{n \in \mathbb {N}}\) be events on \(\mathcal X^n\). For all \(\gamma \in (0,1)\),

\begin{align*} \limsup _{n \to +\infty } \frac{1}{n}\log \frac{1}{\gamma \mu ^{\otimes n}(E_n) + (1 - \gamma )\nu ^{\otimes n}(E_n^c)} \le C_1(\mu , \nu ) \: . \end{align*}

Proof ▶

Let \(\xi \) be a probability measure on \(\mathcal X\) and \(\beta {\gt} 0\). By Corollary 4.1.19,

\begin{align*} \mu ^{\otimes n}(E_n) e^{\operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}) + n\beta } & \ge \xi ^{\otimes n}(E_n) - \xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \mu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}) {\gt} n\beta \right\} \: , \\ \nu ^{\otimes n}(E_n^c) e^{\operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n}) + n\beta } & \ge \xi ^{\otimes n}(E_n^c) - \xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \nu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n}) {\gt} n\beta \right\} \: . \end{align*}

We sum both inequalities with weights \(\gamma \) and \(1-\gamma \) respectively and use that each \(\operatorname{KL}\) on the left is less than their max, as well as \(\xi ^{\otimes n}(E_n) + \xi ^{\otimes n}(E_n^c) = 1\).

\begin{align*} & e^{n\beta } (\gamma \mu ^{\otimes n}(E_n) + (1-\gamma )\nu ^{\otimes n}(E_n^c)) e^{\max \{ \operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}), \operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n})\} } \\ & \ge \min \{ \gamma , 1-\gamma \} - \gamma \xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \mu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \mu ^{\otimes n}) {\gt} n\beta \right\} - (1 - \gamma )\xi ^{\otimes n}\left\{ \log \frac{d \xi ^{\otimes n}}{d \nu ^{\otimes n}} - \operatorname{KL}(\xi ^{\otimes n}, \nu ^{\otimes n}) {\gt} n\beta \right\} \: . \end{align*}

Let \(p_{n,\mu }(\beta )\) and \(p_{n, \nu }(\beta )\) be the two probabilities on the right hand side. By the law of large numbers, both tend to 0 when \(n\) tends to \(+\infty \). In particular, for \(n\) large enough, the right hand side is positive and we can take logarithms on both sides. We also use the tensorization of \(\operatorname{KL}\) (Theorem 2.4.18).

\begin{align*} & n \max \{ \operatorname{KL}(\xi , \mu ), \operatorname{KL}(\xi , \nu )\} \\ & \ge \log \frac{1}{\gamma \mu ^{\otimes n}(E_n) + (1 - \gamma )\nu ^{\otimes n}(E_n^c)} + \log (\min \{ \gamma , 1-\gamma \} - \gamma p_{n, \mu }(\beta ) - (1 - \gamma ) p_{n, \nu }(\beta )) - n\beta \end{align*}

For \(n \to +\infty \),

\begin{align*} \max \{ \operatorname{KL}(\xi , \mu ), \operatorname{KL}(\xi , \nu )\} \ge \limsup _{n \to + \infty }\frac{1}{n}\log \frac{1}{\gamma \mu ^{\otimes n}(E_n) + (1 - \gamma )\nu ^{\otimes n}(E_n^c)} - \beta \end{align*}

Since \(\beta {\gt} 0\) is arbitrary, we can take a supremum over \(\beta \) on the right.

4.2 Sample complexity

We study the sample complexity of binary hypothesis testing.

Definition 35

The sample complexity of simple binary hypothesis testing with prior \((\pi , 1 - \pi ) \in \mathcal P(\{ 0, 1\} )\) at risk level \(\delta \in \mathbb {R}_{+, \infty }\) is

\begin{align*} n(\mu , \nu , \pi , \delta ) = \min \{ n \in \mathbb {N} \mid B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \} \: . \end{align*}

This is the sample complexity \(n_\xi ^P(\delta )\) of Definition 9 specialized to simple binary hypothesis testing.

Lemma 4.2.1

For \(\delta \ge \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing is \(n(\mu , \nu , \pi , \delta ) = 0\) .

Proof ▶

Lemma 4.2.2

The sample complexity of simple binary hypothesis testing satisfies \(n(\mu , \nu , \pi , \delta ) \le n_0\) , with \(n_0\) the smallest natural number such that

\begin{align*} \log \frac{1}{\delta } \le \sup _{\alpha \in (0,1)} \left( n_0 (1 - \alpha )R_\alpha (\mu , \nu ) + \alpha \log \frac{1}{\pi } + (1 - \alpha )\log \frac{1}{1 - \pi } \right) \end{align*}

Proof ▶

It suffices to show that \(B_\pi (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) \le \delta \) . By Corollary 4.1.2, for all \(\alpha \in (0,1)\),

\begin{align*} B_\pi (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) & \le \pi ^{\alpha } (1 - \pi )^{1 - \alpha }e^{- (1 - \alpha )R_\alpha (\mu ^{\otimes n_0}, \nu ^{\otimes n_0})} \: . \end{align*}

The Rényi divergence tensorizes: \(R_\alpha (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) = n_0 R_\alpha (\mu , \nu )\) .

\begin{align*} B_\pi (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) & \le \exp \left(- \left( n_0 (1 - \alpha )R_\alpha (\mu , \nu ) + \alpha \log \frac{1}{\pi } + (1 - \alpha )\log \frac{1}{1 - \pi } \right) \right) \: . \end{align*}

By definition, \(n_0\) is such that the infimum over \(\alpha \) of the right hand side is less than \(\delta \).

Lemma 4.2.3

For \(\delta \le \pi \le 1/2\), the sample complexity of simple binary hypothesis testing satisfies

\begin{align*} n(\mu , \nu , \pi , \delta ) \ge \frac{\log \frac{\pi }{2\delta }}{R_{1/2}(\mu , \nu )} \: . \end{align*}

Proof ▶

By Theorem 4.1.11,

\begin{align*} \log \frac{\pi }{B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n})} \le R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n}) + \log 2 \: . \end{align*}

The Rényi divergence tensorizes (Theorem 2.6.25): \(R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_{1/2}(\mu , \nu )\) . We finally use that \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \) for \(n = n(\mu , \nu , \pi , \delta )\) .

Lemma 4.2.4

For \(\delta \le \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing satisfies

\begin{align*} n(\mu , \nu , \pi , \delta ) \ge \frac{h_2(\pi ) - h_2(\delta )}{\operatorname{JS}_\pi (\mu , \nu )} \: , \end{align*}

in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.

Proof ▶

We start from Theorem 4.1.10.

\begin{align*} \operatorname{JS}_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \ge h_2(\pi ) - h_2(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n})) \: . \end{align*}

By Lemma 2.9.6, \(\operatorname{JS}_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le n \operatorname{JS}_\pi (\mu , \nu )\). The final result is obtained by using that \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \) for \(n = n(\mu , \nu , \pi , \delta )\).

4.3 Sequential testing

For \(\mu \in \mathcal M(\mathcal X)\), let \(\mu ^{\otimes \mathbb {N}}\) be the corresponding product measure on \(\mathcal X^{\mathbb {N}}\). For \(n \in \mathbb {N}\), let \(\mathcal F_n\) be the \(\sigma \)-algebra of \(\mathcal X^{\mathbb {N}}\) generated by the projection to the first \(n\) coordinates. Denote that projection by \(\pi _{[n]}\). \(\mathcal F = (\mathcal F_n)_{n \in \mathbb {N}}\) is a filtration. Let \(\tau \) be a stopping time with respect to \(\mathcal F\). \(\mu _\tau \) denotes the restriction of the measure \(\mu ^{\otimes \mathbb {N}}\) to \(\mathcal F_\tau \), the sigma-algebra generated by the stopping time \(\tau \).

In sequential testing, two components need to be designed for estimation: the stopping time \(\tau \) and the \(\mathcal F_\tau \)-measurable estimator that returns an answer given the data available at \(\tau \). We will be interested in two performance metrics:

The risk, where the data generating kernel is \(P_\tau \) which takes values \(\mu _\tau \) and \(\nu _\tau \).
The stopping time \(\tau \) (through its mean, other moments, or high probability bounds).

The idea is that a good estimator stops quickly with low risk.

Lemma 4.3.1

For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \) and \(n \in \mathbb {N}\), \(\nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}\)-almost surely, \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}(x) = \prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\).

Proof ▶

By Lemma B.4.2, \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}(x) = \frac{d \pi _{[n]*}\mu ^{\otimes \mathbb {N}}}{d \pi _{[n]*}\nu ^{\otimes \mathbb {N}}}(\pi _{[n]}(x))\) a.s. . The pushforwards simplify: \(\pi _{[n]*}\nu ^{\otimes \mathbb {N}} = \nu ^{\otimes n}\).

Lemma 4.3.2

For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \), \(\nu _\tau \)-almost surely, \(\frac{d \mu _\tau }{d \nu _\tau }(x) = \prod _{n=1}^\tau \frac{d \mu }{d \nu }(x_n)\).

Proof ▶

It suffices to show the equality of their integrals on \(\mathcal F_\tau \)-measurable sets. Let \(E\) be such a set. Then \(E = \sum _{n=1}^\infty E \cap \{ \tau = n\} \) . For \(n \in \mathbb {N}\), let \(E_n = E \cap \{ \tau = n\} \).

We show that \(\nu _\tau \left[\mathbb {I}(E_n) \frac{d \mu _\tau }{d \nu _\tau }\right] = \nu _\tau \left[x \mapsto \mathbb {I}(E_n)\prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\right]\). The result for \(E\) then follows from the monotone convergence theorem. The product \(\prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\) is equal to \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}\) by Lemma 4.3.1.

\(E_n\) is \(\mathcal F_n\)-measurable, hence (using Lemma B.4.1) TODO

Theorem 4.3.3

For \(\mu , \nu \in \mathcal P(\mathcal X)\), \(\operatorname{KL}(\mu _\tau , \nu _\tau ) = \mu [\tau ] \operatorname{KL}(\mu , \nu )\).

Proof ▶

TODO: need Wald’s first identity.

Lemma 4.3.4

For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \((\mu ^{(\alpha , \nu )})_\tau = (\mu _\tau )^{(\alpha , \nu _\tau )}\).

Proof ▶

This is a guess, I did not check it. TODO

Theorem 4.3.5

For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \(R_\alpha (\mu _\tau , \nu _\tau ) = \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu )\).

Proof ▶

From Corollary 2.6.18, \((1 - \alpha ) R_\alpha (\mu , \nu ) = \alpha \operatorname{KL}(\mu ^{(\alpha , \nu )}, \mu ) + (1 - \alpha )\operatorname{KL}(\mu ^{(\alpha , \nu )}, \nu )\), hence we can use Theorem 4.3.3 to write

\begin{align*} \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu ) & = \frac{1}{1 - \alpha }\left(\alpha \mu ^{\alpha , \nu }[\tau ]\operatorname{KL}(\mu ^{(\alpha , \nu )}, \mu ) + (1 - \alpha )\mu ^{\alpha , \nu }[\tau ]\operatorname{KL}(\mu ^{(\alpha , \nu )}, \nu ) \right) \\ & = \frac{1}{1 - \alpha }\left(\alpha \operatorname{KL}((\mu ^{(\alpha , \nu )})_\tau , \mu _\tau ) + (1 - \alpha )\operatorname{KL}((\mu ^{(\alpha , \nu )})_\tau , \nu _\tau ) \right) \end{align*}

It then suffices to remark that \((\mu ^{(\alpha , \nu )})_\tau = (\mu _\tau )^{(\alpha , \nu _\tau )}\), thanks to Lemma 4.3.4.

TODO: if that lemma does not hold, we still have the inequality \(R_\alpha (\mu _\tau , \nu _\tau ) \le \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu )\) by Lemma 2.6.19.