4 The risk of simple binary hypothesis testing
The goal of this section is to obtain bounds on the risk of simple binary hypothesis testing \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n})\) or on \(\inf _{E} \max \{ \mu ^{\otimes n}(E), \nu ^{\otimes n}(E^c)\} \) that show how those quantities depends on the number of samples \(n \in \mathbb {N}\).
The main idea is to relate these to divergences that tensorize, or on which we have tensorization inequalities. Those divergences are \(\operatorname{KL}\), \(R_\alpha \), \(C_\alpha \), \(\operatorname{JS}_\alpha \) (and more generally the mutual information if we want to consider more than two hypotheses).
4.1 Bounding the Bayes risk with tensorizing divergences
4.1.1 Upper bound with the Chernoff information
Let \(\zeta \) be a measure such that \(\mu \ll \zeta \) and \(\nu \ll \zeta \). Let \(p = \frac{d \mu }{d\zeta }\) and \(q = \frac{d \nu }{d\zeta }\). For \(\alpha \in (0,1)\), for \(g_\alpha (x) = \min \{ (\alpha -1)x, \alpha x\} \),
For \(\alpha \in (0,1)\),
Use \(g_\alpha (x) \le 0\) in Lemma 4.1.1.
For probability measures,
By definition, since \(\mu \) and \(\nu \) are probability measures, \(1 - \operatorname{TV}(\mu , \nu ) = \mathcal B_{(1,1)}(\mu , \nu )\). Then apply Lemma 4.1.3.
4.1.2 Lower bounds using the data-processing inequality
The ideas behind the next two theorems are the same, up to the order of the arguments of a Kullback-Leibler divergence. The hidden parameter and data in simple binary hypothesis testing are generated from a distribution \(\pi \otimes P \in \mathcal P(\Theta \times \mathcal X)\). The reason that an estimator can get low risk is that \(P\) depends on the parameter: if it did not, the Bayes risk with prior \((\alpha , 1 - \alpha )\) would be \(\min \{ \alpha , 1 - \alpha \} \). We will compare \(\pi \otimes P\) to a product distribution \(\pi ' \times \xi \) for \(\pi ' \in \mathcal P(\Theta )\) and \(\xi \in \mathcal P(\mathcal X)\). In order to do that, we use the data-processing inequality for \(\operatorname{KL}\), starting either from \(\operatorname{KL}(\pi \otimes P, \pi ' \times \xi )\) or from \(\operatorname{KL}(\pi ' \times \xi , \pi \otimes P)\) (the two theorems correspond to the two choices). The data-processing inequality is used to say that those divergences are greater than the divergences between the losses of the two corresponding estimation tasks.
We start with some preparatory lemmas.
Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then
This is Lemma 2.3.63 specialized to the Kullback-Leibler divergence.
Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then
Use \(\operatorname{KL}(\pi \otimes Q, \xi \otimes P) = \operatorname{KL}(\pi , \xi ) + \operatorname{KL}(\pi \otimes Q, \pi \otimes P)\) in Lemma 4.1.5.
Let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then
Apply Lemma 4.1.5.
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then
Apply Corollary 4.1.6.
Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Let \(\bar{\beta } = \min \{ \beta , 1 - \beta \} \). Then
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\).
in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.
Apply Lemma 2.9.5 and then Lemma 4.1.9 with \(\beta = 1/2\) to get
Note that \(\beta = 1/2\) is the value that results in the best bound: the lower bound on \(\operatorname{JS}_\alpha (\mu , \nu )\) from some \(\beta \le 1/2\) would be
For \(\alpha , \beta \in (0, 1/2)\),
As a consequence,
In particular, \(\log \frac{\alpha }{B_\alpha (\mu , \nu )} \le R_{1/2}(\mu , \nu ) + \log 2\) .
By Lemma 2.6.20 and then Lemma 4.1.9,
The first particular case is obtained by using \(\alpha \le 1/2\) and \(B_\pi (\mu , \nu ) \ge 0\). The second one further sets \(\beta = 1/2\).
TODO: move or remove
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Let \(\alpha \in (0,1)\). Then
Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). By Lemma 2.6.9, \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\). That divergence is
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Then
4.1.3 Lower bounds using the change of measure lemma
The main tool of this section is the next lemma, which relates the probability of an event under two measures and deviations of the log-likelihood ratio between the measures.
Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then
Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(f : \mathcal X \to [0,1]\) be a measurable function. Let \(\beta \in \mathbb {R}\). Then
The first and second inequalities use \(f \ge 0\) and the last inequality uses \(f \le 1\) .
Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\) be such that \(\zeta \otimes Q \ll \pi \otimes P\). Then for all \(\beta \in \mathbb {R}\),
Let \(\hat{y}_B\) be a Bayes estimator for \((P, y, \ell ')\). If no such estimator exists, the proof can be adapted by taking an estimator with risk \(\varepsilon \)-close to the Bayes risk and then minimizing over \(\varepsilon {\gt} 0\). Apply Lemma 4.1.15 to the function \((\theta , x) \mapsto \hat{y}_B(x)\left[ z \mapsto \ell '(y(\theta ), z) \right]\) , which takes values in \([0,1]\) and to the measures \(\pi \otimes P\) and \(\zeta \otimes Q\).
Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Then for all \(\beta \in \mathbb {R}\),
in which the infimum over \(\xi \) is restricted to probability measures such that \(\zeta \times \xi \ll \pi \otimes P\) and \(d_{\mathcal X} : \Theta \rightsquigarrow *\) is the discard kernel.
Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta _1, \beta _2 \in \mathbb {R}\). Then
Two applications of Lemma 4.1.14, then sum them and use \(\xi (E)+\xi (E^c) = 1\).
Change of measure and the moments of the log-likelihood ratio
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then
Use Lemma 4.1.14 with the choice \(\operatorname{KL}(\mu , \nu ) + \beta \) for \(\beta \).
For \(\alpha \in (0,1)\), let \(\pi _\alpha \in \mathcal P(\{ 0,1\} )\) be the measure \((\alpha , 1 - \alpha )\). Let \(\alpha , \gamma \in (0,1)\), \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \) . Then for all \(\beta \in \mathbb {R}\) ,
An application of Lemma 4.1.17 gives
Take \(\xi = \mu ^{(\gamma , \nu )}\) and
We obtain
Let \(\mu , \nu \) be two measures on \(\mathcal X\) such that \(\mu \left[\left(\log \frac{d \mu }{d \nu }\right)^2\right] {\lt} \infty \). Let \(E\) be an event on \(\mathcal X\) and let \(\beta {\gt} 0\). Then
Use Lemma 4.1.14 with the choice \(\operatorname{KL}(\mu , \nu ) + \sqrt{\operatorname{Var}_\mu [\log \frac{d \mu }{d \nu }]\beta }\) for \(\beta \) and bound the probability of deviation of the log-likelihood ratio with Chebychev’s inequality.
For \(\mu , \nu \) finite measures and \(\alpha , \beta {\gt} 0\),
This is a Chernoff bound, using that the cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\mu \) has value \(\alpha R_{1+\alpha }(\mu , \nu )\) at \(\alpha \) by Lemma 2.6.7.
Then \(\mu \left[\left(\frac{d \mu }{d \nu }\right)^\alpha \right] = \nu \left[\left(\frac{d \mu }{d \nu }\right)^{1+\alpha } \right] = e^{\alpha R_{1+\alpha }(\mu , \nu )}\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha ,\beta {\gt} 0\). Then
For \(\alpha \in (0,1)\), let \(\pi _\alpha \in \mathcal P(\{ 0,1\} )\) be the measure \((\alpha , 1 - \alpha )\). Let \(\alpha , \gamma \in (0,1)\), \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \) . Then for all \(\beta {\gt} 0\) and \(\varepsilon {\gt}0\) ,
Applications of the change of measure with 3 points
Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). For \(\beta {\gt} 0\) ,
Use Lemma 4.1.18 with the choices \(\operatorname{KL}(\xi , \mu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\mu }\right]}\) and \(\operatorname{KL}(\xi , \nu ) + \sqrt{\beta \operatorname{Var}_{\xi }\left[\log \frac{d\xi }{d\nu }\right]}\) for \(\beta _1\) and \(\beta _2\). Then use Chebyshev’s inequality to bound the probabilities of deviation of the log-likelihood ratios.
Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha , \beta \ge 0\). Then
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha {\gt} 0\). Then
Use Lemma 4.1.26 with \(\beta = \log (4)/\alpha \) and use that
Product spaces
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\), let \(n \in \mathbb {N}\) and let \(E\) be an event on \(\mathcal X^n\). For all \(\alpha {\gt} 0\),
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \((E_n)_{n \in \mathbb {N}}\) be events on \(\mathcal X^n\). For all \(\gamma \in (0,1)\),
Let \(\xi \) be a probability measure on \(\mathcal X\) and \(\beta {\gt} 0\). By Corollary 4.1.19,
We sum both inequalities with weights \(\gamma \) and \(1-\gamma \) respectively and use that each \(\operatorname{KL}\) on the left is less than their max, as well as \(\xi ^{\otimes n}(E_n) + \xi ^{\otimes n}(E_n^c) = 1\).
Let \(p_{n,\mu }(\beta )\) and \(p_{n, \nu }(\beta )\) be the two probabilities on the right hand side. By the law of large numbers, both tend to 0 when \(n\) tends to \(+\infty \). In particular, for \(n\) large enough, the right hand side is positive and we can take logarithms on both sides. We also use the tensorization of \(\operatorname{KL}\) (Theorem 2.4.18).
For \(n \to +\infty \),
Since \(\beta {\gt} 0\) is arbitrary, we can take a supremum over \(\beta \) on the right.
4.2 Sample complexity
We study the sample complexity of binary hypothesis testing.
The sample complexity of simple binary hypothesis testing with prior \((\pi , 1 - \pi ) \in \mathcal P(\{ 0, 1\} )\) at risk level \(\delta \in \mathbb {R}_{+, \infty }\) is
This is the sample complexity \(n_\xi ^P(\delta )\) of Definition 9 specialized to simple binary hypothesis testing.
For \(\delta \ge \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing is \(n(\mu , \nu , \pi , \delta ) = 0\) .
The sample complexity of simple binary hypothesis testing satisfies \(n(\mu , \nu , \pi , \delta ) \le n_0\) , with \(n_0\) the smallest natural number such that
It suffices to show that \(B_\pi (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) \le \delta \) . By Corollary 4.1.2, for all \(\alpha \in (0,1)\),
The Rényi divergence tensorizes: \(R_\alpha (\mu ^{\otimes n_0}, \nu ^{\otimes n_0}) = n_0 R_\alpha (\mu , \nu )\) .
By definition, \(n_0\) is such that the infimum over \(\alpha \) of the right hand side is less than \(\delta \).
For \(\delta \le \pi \le 1/2\), the sample complexity of simple binary hypothesis testing satisfies
By Theorem 4.1.11,
The Rényi divergence tensorizes (Theorem 2.6.25): \(R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_{1/2}(\mu , \nu )\) . We finally use that \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \) for \(n = n(\mu , \nu , \pi , \delta )\) .
For \(\delta \le \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing satisfies
in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.
We start from Theorem 4.1.10.
By Lemma 2.9.6, \(\operatorname{JS}_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le n \operatorname{JS}_\pi (\mu , \nu )\). The final result is obtained by using that \(B_\pi (\mu ^{\otimes n}, \nu ^{\otimes n}) \le \delta \) for \(n = n(\mu , \nu , \pi , \delta )\).
4.3 Sequential testing
For \(\mu \in \mathcal M(\mathcal X)\), let \(\mu ^{\otimes \mathbb {N}}\) be the corresponding product measure on \(\mathcal X^{\mathbb {N}}\). For \(n \in \mathbb {N}\), let \(\mathcal F_n\) be the \(\sigma \)-algebra of \(\mathcal X^{\mathbb {N}}\) generated by the projection to the first \(n\) coordinates. Denote that projection by \(\pi _{[n]}\). \(\mathcal F = (\mathcal F_n)_{n \in \mathbb {N}}\) is a filtration. Let \(\tau \) be a stopping time with respect to \(\mathcal F\). \(\mu _\tau \) denotes the restriction of the measure \(\mu ^{\otimes \mathbb {N}}\) to \(\mathcal F_\tau \), the sigma-algebra generated by the stopping time \(\tau \).
In sequential testing, two components need to be designed for estimation: the stopping time \(\tau \) and the \(\mathcal F_\tau \)-measurable estimator that returns an answer given the data available at \(\tau \). We will be interested in two performance metrics:
The risk, where the data generating kernel is \(P_\tau \) which takes values \(\mu _\tau \) and \(\nu _\tau \).
The stopping time \(\tau \) (through its mean, other moments, or high probability bounds).
The idea is that a good estimator stops quickly with low risk.
For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \) and \(n \in \mathbb {N}\), \(\nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}\)-almost surely, \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}(x) = \prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\).
By Lemma B.4.2, \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}(x) = \frac{d \pi _{[n]*}\mu ^{\otimes \mathbb {N}}}{d \pi _{[n]*}\nu ^{\otimes \mathbb {N}}}(\pi _{[n]}(x))\) a.s. . The pushforwards simplify: \(\pi _{[n]*}\nu ^{\otimes \mathbb {N}} = \nu ^{\otimes n}\).
For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \), \(\nu _\tau \)-almost surely, \(\frac{d \mu _\tau }{d \nu _\tau }(x) = \prod _{n=1}^\tau \frac{d \mu }{d \nu }(x_n)\).
It suffices to show the equality of their integrals on \(\mathcal F_\tau \)-measurable sets. Let \(E\) be such a set. Then \(E = \sum _{n=1}^\infty E \cap \{ \tau = n\} \) . For \(n \in \mathbb {N}\), let \(E_n = E \cap \{ \tau = n\} \).
We show that \(\nu _\tau \left[\mathbb {I}(E_n) \frac{d \mu _\tau }{d \nu _\tau }\right] = \nu _\tau \left[x \mapsto \mathbb {I}(E_n)\prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\right]\). The result for \(E\) then follows from the monotone convergence theorem. The product \(\prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\) is equal to \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}\) by Lemma 4.3.1.
\(E_n\) is \(\mathcal F_n\)-measurable, hence (using Lemma B.4.1) TODO
For \(\mu , \nu \in \mathcal P(\mathcal X)\), \(\operatorname{KL}(\mu _\tau , \nu _\tau ) = \mu [\tau ] \operatorname{KL}(\mu , \nu )\).
TODO: need Wald’s first identity.
For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \((\mu ^{(\alpha , \nu )})_\tau = (\mu _\tau )^{(\alpha , \nu _\tau )}\).
This is a guess, I did not check it. TODO
For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \(R_\alpha (\mu _\tau , \nu _\tau ) = \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu )\).
From Corollary 2.6.18, \((1 - \alpha ) R_\alpha (\mu , \nu ) = \alpha \operatorname{KL}(\mu ^{(\alpha , \nu )}, \mu ) + (1 - \alpha )\operatorname{KL}(\mu ^{(\alpha , \nu )}, \nu )\), hence we can use Theorem 4.3.3 to write
It then suffices to remark that \((\mu ^{(\alpha , \nu )})_\tau = (\mu _\tau )^{(\alpha , \nu _\tau )}\), thanks to Lemma 4.3.4.
TODO: if that lemma does not hold, we still have the inequality \(R_\alpha (\mu _\tau , \nu _\tau ) \le \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu )\) by Lemma 2.6.19.