Lower bounds for hypothesis testing based on information theory

1 Testing, Estimation, Risk

In the estimation problem, we are given a family of measures and we want to estimate an unknown parameter for the measure from which we observe samples. Let \(\mathcal X\) be the measurable space on which the measures are defined. The measure family is described by a kernel \(P : \Theta \rightsquigarrow \mathcal X\). We write \(P_\theta \) for \(P(\theta )\). In parametric problems, \(\Theta \) is a low-dimensional space. In nonparametric problems \(\Theta \) is large, for example \(\Theta = \mathcal M (\mathcal X)\) or \(\Theta = \mathcal P(\mathcal X)\).

The goal of the estimation task is to find, given the observation of samples of \(P_\theta \) for an unknown \(\theta \), the value of a function \(y : \Theta \to \mathcal Y\) , which maps the parameter \(\theta \) to the quantity we are interested in. For example, \(y(\theta )\) might be the expected value of the distribution \(P_\theta \).

An estimator is a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\), where most often \(\mathcal Z = \mathcal Y\). Note that, while \(y\) is a function of a parameter in \(\Theta \), \(\hat{y}\) has domain \(\mathcal X\) as the estimation is done based on a sample of the measure, while the measure itself and its parameter remain unknown. In general this is a randomized estimator, but often deterministic estimators (corresponding to deterministic kernels) are enough to attain optimal performance. The quality of the estimation is quantified by a loss \(\ell ' : \mathcal Y \times \mathcal Z \to \mathbb {R}_{+, \infty }\). We write \(\ell \) for a loss on \(\mathcal Y \times \mathcal Y\) and \(\ell '\) in the heterogeneous case.

In short, an estimation problem is specified by the data \((P, y, \ell ')\).

Example: simple binary hypothesis testing (SBHT)

We say that an estimation problem is a testing problem if \(\mathcal Y = \mathcal Z\) is discrete and the loss takes values in \(\{ 0, 1\} \). A test is said to be binary if \(\mathcal Y\) has two elements. A test is simple if \(\{ \theta \mid y(\theta ) = y_0\} \) is a singleton for all \(y_0 \in \mathcal Y\), i.e. \(y\) is bijective.

In summary, in simple binary hypothesis testing, \(\Theta = \mathcal Y = \mathcal Z = \{ 0,1\} \), \(y(0) = 0\), \(y(1) = 1\) (that is, \(y\) is the identity). For \(y_0, z \in \{ 0,1\} \), \(\ell (y_0, z) = \mathbb {I}\{ y_0 \ne z\} \).

1.1 Risk

Let \(\hat{\mu }_\theta \) be the law of \(\hat{y}(\theta )\). That is, \(\hat{\mu } : \mathcal\Theta \rightsquigarrow \mathcal X\) is the kernel \(\hat{y} \circ P\).

Definition 1 Risk
#

The risk of an estimator \(\hat{y}\) on the estimation problem \((P, y, \ell ')\) at \(\theta \in \Theta \) is \(r^P_\theta (\hat{y}) = (\hat{y} \circ P)(\theta )\left[z \mapsto \ell '(y(\theta ), z)\right]\) .

Example (SBHT): \(r^P_\theta (\hat{y}) = \hat{\mu }_\theta (\hat{y}(\theta ) \ne y(\theta ))\).

Definition 2 Bayesian risk
#

The Bayesian risk of an estimator \(\hat{y}\) on \((P, y, \ell ')\) for a prior \(\pi \in \mathcal M(\Theta )\) is \(R^P_\pi (\hat{y}) = \pi \left[\theta \mapsto r^P_\theta (\hat{y})\right]\) . It can also be expanded as \(R^P_\pi (\hat{y}) = (\pi \otimes (\hat{y} \circ P))\left[ (\theta , z) \mapsto \ell '(y(\theta ), z) \right]\) .

If we write \(\ell '\) and \(y\) for their deterministic kernels, the Bayesian risk of \(\hat{y}\) is the mean of the measure \(\ell ' \circ (y \parallel \hat{y}) \circ (\pi \otimes P)\) .

Definition 3 Bayes risk
#

The Bayes risk of \((P, y, \ell ')\) for prior \(\pi \in \mathcal M(\Theta )\) is \(\mathcal R^P_\pi = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} R^P_\pi (\hat{y})\) , where the infimum is over Markov kernels.

The Bayes risk of \((P, y, \ell ')\) is \(\mathcal R^*_B = \sup _{\pi \in \mathcal P(\Theta )} \mathcal R^P_\pi \: .\)

Example (SBHT): \(\mathcal R^*_B = \sup _{\gamma \in [0,1]}\inf _{\hat{y}}\left(\gamma \hat{\mu }_0(\hat{y} \ne 0) + (1 - \gamma ) \hat{\mu }_1(\hat{y} \ne 1)\right)\).

Definition 4 Bayes estimator
#

An estimator \(\hat{y}\) is said to be a Bayes estimator for a prior \(\pi \in \mathcal P(\Theta )\) if \(R^P_\pi (\hat{y}) = \mathcal R^P_\pi \).

Definition 5 Minimax risk
#

The minimax risk of \((P, y, \ell ')\) is \(\mathcal R^* = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} \sup _{\theta \in \Theta } r^P_\theta (\hat{y})\) .

Example (SBHT): \(\mathcal R^* = \inf _{\hat{y}} \max \{ \hat{\mu }_0(\hat{y} \ne 0), \hat{\mu }_1(\hat{y} \ne 1)\} \).

Lemma 1.1.1

\(\mathcal R_B^* \le \mathcal R^*\).

Proof

For any \(\pi \in \mathcal P(\mathcal X)\) and any estimator, \(\pi \left[\hat{\mu }_\theta \left[\ell '(y(\theta ), \hat{y}(\theta ))\right]\right] \le \sup _{\theta \in \Theta }\hat{\mu }_\theta \left[\ell '(y(\theta ), \hat{y}(\theta ))\right]\).

TODO: “often”, \(\mathcal R^*_B = \mathcal R^*\).

1.1.1 Properties of the Bayes risk of a prior

Lemma 1.1.2

The Bayes risk of a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) with \(P\) a Markov kernel satisfies

\begin{align*} \mathcal R^P_\pi \le \inf _{z \in \mathcal Z} \pi \left[ \theta \mapsto \ell ’(y(\theta ), z) \right] \: . \end{align*}
Proof

The infimum over all Markov kernels in the definition of the Bayes risk of \(\pi \) is bounded from above by the infimum restricted to constant, deterministic Markov kernels.

Lemma 1.1.3

The Bayes risk of a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) with \(P\) a constant Markov kernel is

\begin{align*} \mathcal R^P_\pi = \inf _{z \in \mathcal Z} \pi \left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

In particular, it does not depend on \(P\).

Proof

Let \(\xi \) be the measure such that \(P(\theta ) = \xi \) for all \(\theta \).

\begin{align*} \mathcal R^P_\pi & = \inf _{\hat{y}}(\pi \times (\hat{y} \circ \xi ))\left[(\theta , z)\mapsto \ell ’(y(\theta ), z)\right] \\ & = \inf _{\mu \in \mathcal P(\mathcal X)} (\pi \times \mu )\left[(\theta , z)\mapsto \ell ’(y(\theta ), z)\right] \\ & = \inf _{\mu \in \mathcal P(\mathcal X)} \mu \left[ z \mapsto \pi \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \\ & = \inf _{z \in \mathcal Z} \pi \left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

Data-processing inequality and consequences

Theorem 1.1.4 Data-processing inequality

For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{\kappa \circ P}_\pi \ge \mathcal R^{P}_\pi \) (where the estimation problems differ only in the kernel).

Proof

The risk \(\mathcal R^{\kappa \circ P}_\pi \) is

\begin{align*} \mathcal R^{\kappa \circ P}_\pi & = \inf _{\hat{y} : \mathcal X' \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ \kappa \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & \ge \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = \mathcal R^{P}_\pi \: . \end{align*}

The inequality is due to taking an infimum over a larger set: \(\{ \hat{y} \circ \kappa \mid \hat{y} : \mathcal X' \rightsquigarrow \mathcal Z \text{ Markov}\} \subseteq \{ \hat{y}' : \mathcal X \rightsquigarrow \mathcal Z \mid \hat{y} \text{ Markov}\} \) .

For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \Theta \times \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{P \otimes \kappa }_\pi \le \mathcal R^{P}_\pi \) .

Proof

Use Theorem 1.1.4: \(P\) is the composition of \(P \otimes \kappa \) and the deterministic kernel \(\mathcal X \times \mathcal X' \rightsquigarrow \mathcal X\) that forgets the second value.

Lemma 1.1.6

For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \Theta \times \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{P \otimes \kappa }_\pi \le \mathcal R^{(P \otimes \kappa )_{\mathcal X'}}_\pi \) , in which \((P \otimes \kappa )_{\mathcal X'} : \mathcal\Theta \rightsquigarrow \mathcal X'\) is the kernel obtained by marginalizing over \(\mathcal X\) in the output of \(P \otimes \kappa \) .

Proof

Use Theorem 1.1.4: \((P \otimes \kappa )_{\mathcal X'}\) is the composition of \(P \otimes \kappa \) and the deterministic kernel \(\mathcal X \times \mathcal X' \rightsquigarrow \mathcal X'\) that forgets the first value.

Lemma 1.1.7

The Bayes risk \(\mathcal R_\pi ^P\) is concave in \(P : \Theta \rightsquigarrow \mathcal X\) .

Proof

The infimum of a sum is larger than the sum of the infimums:

\begin{align*} \mathcal R_\pi ^{\lambda P_1 + (1 - \lambda )P_2} & = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ (\lambda P_1 + (1 - \lambda )P_2)))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} \left( \lambda (\pi \otimes (\hat{y} \circ P_1))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \right. \\ & \qquad \left. + (1 - \lambda ) (\pi \otimes (\hat{y} \circ P_2))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \right) \\ & \ge \lambda \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P_1))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & \qquad + (1 - \lambda ) \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P_2))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = \lambda \mathcal R_\pi ^{P_1} + (1 - \lambda )\mathcal R_\pi ^{P_2} \: . \end{align*}

TODO: can we build kernels such that this lemma is a consequence of one of the lemmas above?

Generalized Bayes estimator

The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies

\begin{align*} R^P_\pi (\hat{y}) = ((P_\pi ^\dagger \times \hat{y}) \circ P \circ \pi )\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \: , \end{align*}

whenever the Bayesian inverse \(P_\pi ^\dagger \) of \(P\) with respect to \(\pi \) exists (Definition 48).

Proof

Use the main property of the Bayesian inverse.

\begin{align*} (P_\pi ^\dagger \times \hat{y}) \circ P \circ \pi & = (\mathrm{id} \parallel \hat{y}) \circ (P_\pi ^\dagger \times \mathrm{id}) \circ (P \circ \pi ) \\ & = (\mathrm{id} \parallel \hat{y}) \circ (\mathrm{id} \times P) \circ \pi \\ & = (\mathrm{id} \times (\hat{y} \circ P)) \circ \pi \\ & = \pi \otimes (\hat{y} \circ P) \: . \end{align*}

The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies

\begin{align*} R^P_\pi (\hat{y}) \ge (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}
Proof

Starting from the equality of Lemma 1.1.8, we get

\begin{align*} R^P_\pi (\hat{y}) & = ((P_\pi ^\dagger \times \hat{y}) \circ P \circ \pi )\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = (P \circ \pi )\left[x \mapsto (P_\pi ^\dagger (x) \times \hat{y}(x)) \left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right]\right] \\ & = (P \circ \pi )\left[x \mapsto \hat{y}(x)\left[z \mapsto P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right]\right] \\ & \ge (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}
Definition 6 Generalized Bayes estimator
#

The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on \((P, y, \ell ')\) is the deterministic estimator \(\mathcal X \to \mathcal Z\) given by

\begin{align*} x \mapsto \arg \min _z P_\pi ^\dagger (x)\left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: , \end{align*}

if there exists such a measurable argmin.

The Bayesian risk of the generalized Bayes estimator \(\hat{y}_B\) is

\begin{align*} R^P_\pi (\hat{y}_B) = (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}
Proof

Start from the equality of Lemma 1.1.8 and use the definition of the generalized Bayes estimator.

When the generalized Bayes estimator is well defined, it is a Bayes estimator. The value of the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is then

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}
Proof

By Lemma 1.1.9, the Bayesian risk of the generalized Bayes estimator obtained in Lemma 1.1.10 is a lower bound for the Bayesian risk.

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} \pi \left[\theta \mapsto \frac{d P_\theta }{d (P \circ \pi )}(x) \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}
Proof

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )[1] - (P \circ \pi )\left[x \mapsto \sup _{z \in \mathcal Z} P_\pi ^\dagger (x) \{ z\} \right] \: . \end{align*}
Proof

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )[1] - (P \circ \pi )\left[x \mapsto \sup _{z \in \mathcal Z} \frac{d P_z}{d (P \circ \pi )}(x) \pi \{ z\} \right] \: . \end{align*}

When \(\pi \) is a probability measure and \(P\) is a Markov kernel, \((P \circ \pi )[1] = 1\).

Proof
Lemma 1.1.15

The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on the estimation problem defined by \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is

\begin{align*} x \mapsto \arg \max _z \left( \pi \{ z\} \frac{d P_z}{d (P \circ \pi )}(x) \right) \: . \end{align*}
Proof

Suppose that \(\Theta \) is finite and let \(\xi \in \mathcal P(\Theta )\). The Bayes risk with respect to the prior \(\pi \in \mathcal P(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\), \(P\) a Markov kernel and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) satisfies

\begin{align*} \mathcal R^P_\pi \le 1 - \left(\prod _{\theta \in \Theta } \pi _\theta ^{\xi _\theta }\right) (P \circ \pi )\left[x \mapsto \prod _{\theta \in \Theta } \left(\frac{d P_\theta }{d (P \circ \pi )}(x)\right)^{\xi _\theta } \right] \: . \end{align*}
Proof

1.1.2 Bayes risk increase

TODO: remove this section? It’s not used for anything now.

The following is a new definition, which is a natural generalization of the DeGroot statistical information.

Definition 7
#

The Bayes risk increase \(I^P_{\pi }(\kappa )\) of a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) with respect to the estimation problem \((P, y, \ell ')\) and the prior \(\pi \in \mathcal M(\Theta )\) is the difference of the Bayes risk of \((\kappa \circ P, y, \ell ')\) and that of \((P, y, \ell ')\). That is,

\begin{align*} I^P_{\pi }(\kappa ) & = \mathcal R^{\kappa \circ P}_\pi - \mathcal R^P_\pi \\ & = \inf _{\hat{y} : \mathcal X' \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ \kappa \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] - \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

In particular, if we take \(\kappa : \mathcal X \rightsquigarrow *\) equal to the Markov kernel to the point space (denoted by \(d_{\mathcal X}\), for “discard” kernel), we get \(I^P_{\pi }(d_{\mathcal X}) = \mathcal R^{d_{\mathcal X} \circ P}_\pi - \mathcal R^P_\pi \), where \(d_{\mathcal X} \circ P = d_\Theta \) if \(P\) is a Markov kernel. We recover a difference between the risk of the best constant kernel and the Bayes estimator, as in the DeGroot statistical information.

The risk increase quantifies how much risk we accrue by forgetting information due to the composition with \(\kappa \).

Lemma 1.1.17

For \(\kappa \) a Markov kernel, \(I^P_\pi (\kappa ) \ge 0\) .

Proof

Use Theorem 1.1.4.

Lemma 1.1.18

For \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) and \(\eta : \mathcal X' \rightsquigarrow \mathcal X''\) two Markov kernels, \(I^P_\pi (\eta \circ \kappa ) = I^P_\pi (\kappa ) + I^{\kappa \circ P}_\pi (\eta )\) .

Proof
\begin{align*} I^P_\pi (\kappa ) + I^{\kappa \circ P}_\pi (\eta ) & = \mathcal R^{\kappa \circ P}_\pi - \mathcal R^{P}_\pi + \mathcal R^{\eta \circ \kappa \circ P}_\pi - \mathcal R^{\kappa \circ P}_\pi \\ & = \mathcal R^{\eta \circ \kappa \circ P}_\pi - \mathcal R^{P}_\pi \\ & = I^P_\pi (\eta \circ \kappa ) \: . \end{align*}
Lemma 1.1.19 Data-processing inequality

For any measurable space \(\mathcal X\), let \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) be the Markov kernel to the point space. For all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\),

\begin{align*} I_\pi ^P(d_{\mathcal X}) \ge I_\pi ^{\kappa \circ P}(d_{\mathcal X'}) \: . \end{align*}
Proof

By Lemma 1.1.18, then Lemma 1.1.17,

\begin{align*} I_\pi ^P(d_{\mathcal X'} \circ \kappa ) = I^P_\pi (\kappa ) + I^{\kappa \circ P}_\pi (d_{\mathcal X'}) \ge I_\pi ^{\kappa \circ P}(d_{\mathcal X'}) \: . \end{align*}

Finally, \(d_{\mathcal X'} \circ \kappa = d_{\mathcal X}\) since \(\kappa \) is a Markov kernel.

1.2 Simple binary hypothesis testing

In this section, for a measure \(\xi \) on \(\{ 0,1\} \), we write \(\xi _0\) for \(\xi (\{ 0\} )\) and \(\xi _1\) for \(\xi (\{ 1\} )\) . We sometimes write \((a,b)\) for the measure \(\xi \) on \(\{ 0,1\} \) with \(\xi _0 = a\) and \(\xi _1 = b\).

Lemma 1.2.1

The Bayesian inverse of a kernel \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) with respect to a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(P_\xi ^\dagger (x) = \left(\xi _0\frac{d P(0)}{d(P \circ \xi )}(x), \xi _1\frac{d P(1)}{d(P \circ \xi )}(x)\right)\) (almost surely w.r.t. \(P \circ \xi = \xi _0 P(0) + \xi _1 P(1)\)).

Proof

For \(\Theta = \{ 0,1\} \), the Bayes risk of a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is

\begin{align*} \mathcal R^P_\xi = (P \circ \xi )\left[x \mapsto \inf _{z \in \mathcal Z} \left( \xi _0\frac{d P(0)}{d(P \circ \xi )}(x)\ell ’(y(0), z) + \xi _1\frac{d P(1)}{d(P \circ \xi )}(x)\ell ’(y(1), z) \right) \right] \: . \end{align*}
Proof

Use Theorem 1.1.11, with the value of \(P_\xi ^\dagger \) given by Lemma 1.2.1.

1.2.1 Bayes risk of a prior

Definition 8
#

The Bayes binary risk between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\), denoted by \(\mathcal B_\xi (\mu , \nu )\), is the Bayes risk \(\mathcal R^P_\xi \) for \(\Theta = \mathcal Y = \mathcal Z = \{ 0,1\} \), \(\ell (y,z) = \mathbb {I}\{ y \ne z\} \), \(P\) the kernel sending 0 to \(\mu \) and 1 to \(\nu \) and prior \(\xi \). That is,

\begin{align*} \mathcal B_\xi (\mu , \nu ) = \inf _{\hat{y} : \mathcal X \rightsquigarrow \{ 0,1\} }\left(\xi _0 (\hat{y} \circ \mu )(\{ 1\} ) + \xi _1 (\hat{y} \circ \nu )(\{ 0\} )\right) \: , \end{align*}

in which the infimum is over Markov kernels.

If the prior is a probability measure with weights \((\pi , 1 - \pi )\), we write \(B_\pi (\mu , \nu ) = \mathcal B_{(\pi , 1 - \pi )}(\mu , \nu )\) .

Thanks to the following lemma, we can prove properties of the Bayes binary risk for non-zero finite measures by considering only probability measures.

Lemma 1.2.3

For all \(a, b {\gt} 0\), \(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{(a \xi _0, b \xi _1)}(a^{-1} \mu , b^{-1} \nu )\) .

Proof

Alternatively, we can reduce the study of the Bayes binary risk for any prior to the study of the prior \((1, 1)\), for which that risk is related to the total variation distance (defined later).

\(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{(1,1)}(\xi _0\mu , \xi _1\nu )\) .

Proof
Theorem 1.2.5 Data-processing inequality

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \ge \mathcal B_\xi (\mu , \nu )\) .

Proof

Apply Theorem 1.1.4.

Lemma 1.2.6
#

For \(\mu \in \mathcal M(\mathcal X)\), \(\mathcal B_\xi (\mu , \mu ) = \min \{ \xi _0, \xi _1\} \mu (\mathcal X)\) .

Proof
Lemma 1.2.7

For all measures \(\mu , \nu \), \(\mathcal B_\xi (\mu , \nu ) \ge 0\) .

Proof

It is an infimum of non-negative values.

For all measures \(\mu , \nu \), \(\mathcal B_\xi (\mu , \nu ) \le \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} \) .

Proof

Let \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) be the discard kernel and \(\delta _*\) be the only probability measure over \(*\), that is a Dirac delta. Then applying the DPI (Theorem 1.2.5), followed by Lemma 1.2.3 and Lemma 1.2.6, we obtain:

\begin{align*} \mathcal B_\xi (\mu , \nu ) & \le \mathcal B_\xi (d_{\mathcal X} \circ \mu , d_{\mathcal X} \circ \nu ) \\ & = \mathcal B_\xi (\mu (\mathcal X) \delta _*, \nu (\mathcal X) \delta _*) \\ & = \mathcal B_{(\mu (\mathcal X) \xi _0, \nu (\mathcal X) \xi _1)}(\delta _*, \delta _*) \\ & = \min \{ \mu (\mathcal X) \xi _0, \nu (\mathcal X) \xi _1\} \delta _*(\mathcal X) \\ & = \min \{ \mu (\mathcal X) \xi _0, \nu (\mathcal X) \xi _1\} \: . \end{align*}
Lemma 1.2.9
#

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\xi \in \mathcal M(\{ 0,1\} )\), \(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{\xi _{\leftrightarrow }}(\nu , \mu )\) where \(\xi _{\leftrightarrow } \in \mathcal M(\{ 0,1\} )\) is such that \(\xi _{\leftrightarrow }(\{ 0\} ) = \xi _1\) and \(\xi _{\leftrightarrow }(\{ 1\} ) = \xi _0\). For \(\pi \in [0,1]\), \(B_\pi (\mu , \nu ) = B_{1 - \pi }(\nu , \mu )\) .

Proof

The generalized Bayes estimator for the Bayes binary risk with prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(x \mapsto \text{if } \xi _1\frac{d \nu }{d(P \circ \xi )}(x) \le \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \text{ then } 0 \text{ else } 1\), i.e. it is equal to \(\mathbb {I}_E\) for \(E = \{ x \mid \xi _1\frac{d \nu }{d(P \circ \xi )}(x) {\gt} \xi _0\frac{d \mu }{d(P \circ \xi )}(x)\} \) .

Proof

The generalized Bayes estimator is defined by

\begin{align*} x \mapsto \arg \min _z P_\xi ^\dagger (x)\left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

For the simple binary hypothesis testing problem,

\begin{align*} P_\xi ^\dagger (x)\left[\theta \mapsto \ell ’(y(\theta ), z)\right] & = P_\xi ^\dagger (x)(\theta \ne z) \: . \end{align*}

For \(z = 0\), this is \(P_\xi ^\dagger (x)(\{ 1\} )\) and for \(z = 1\), this is \(P_\xi ^\dagger (x)(\{ 0\} )\). The generalized Bayes estimator is \(x \mapsto \text{if } P_\xi ^\dagger (x)(\{ 1\} ) \le P_\xi ^\dagger (x)(\{ 0\} ) \text{ then } 0 \text{ else } 1\).

The expression of the lemma statement is obtained by replacing \(P_\xi ^\dagger \) by its value (Lemma 1.2.1).

\begin{align*} \mathcal B_\xi (\mu , \nu ) = \inf _{E \text{ event}} \left( \xi _0 \mu (E) + \xi _1 \nu (E^c) \right) \: . \end{align*}
Proof

By definition,

\begin{align*} \mathcal B_\xi (\mu , \nu ) = \inf _{\hat{y} : \mathcal X \rightsquigarrow \{ 0,1\} }\left(\xi _0 (\hat{y} \circ \mu )(\{ 1\} ) + \xi _1 (\hat{y} \circ \nu )(\{ 0\} )\right) \: . \end{align*}

If we restrict the infimum to deterministic kernels corresponding to functions of the form \(x \mapsto \mathbb {I}\{ x \in E\} \), we get that \(\mathcal B_\xi (\mu , \nu )\) is bounded from above by the expression for which we want to show equality.

Since the Bayes estimator has that shape (Lemma 1.2.10), \(\mathcal B_\xi (\mu , \nu )\) is also greater than the infimum and we have equality.

The Bayes risk of simple binary hypothesis testing for prior \(\xi \in \mathcal M(\{ 0,1\} )\) is

\begin{align*} \mathcal B_\xi (\mu , \nu ) = (P \circ \xi )\left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d(P \circ \xi )}(x), \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\} \right] \: . \end{align*}
Proof

By Theorem 1.1.11, the generalized Bayes estimator is a Bayes estimator, hence it suffices to compute its Bayesian risk. By Lemma 1.1.10,

\begin{align*} R_\xi (\hat{y}_B) & = (P \circ \xi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\xi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \\ & = (P \circ \xi )\left[x \mapsto \min \left\{ P_\xi ^\dagger (x)(\{ 0\} ), P_\xi ^\dagger (x)(\{ 1\} )\right\} \right] \\ & = (P \circ \xi )\left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d(P \circ \xi )}(x), \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\} \right] \: . \end{align*}

The last line is obtained by replacing \(P_\xi ^\dagger \) by its value (Lemma 1.2.1).

Corollary 1.2.13
\begin{align*} \mathcal B_\xi (\mu , \nu ) = \frac{1}{2}\left((P \circ \xi )(\mathcal X) - (P \circ \xi )\left[x \mapsto \left\vert \xi _0\frac{d \mu }{d(P \circ \xi )}(x) - \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\vert \right] \right) \: . \end{align*}
Proof

Use Theorem 1.2.12 and the equality \(\min \{ a,b\} = \frac{1}{2}(a + b - \vert a - b \vert )\).

Lemma 1.2.14

Let \(\hat{y}_B\) be the generalized Bayes estimator for simple binary hypothesis testing. The distribution \(\ell \circ (\mathrm{id} \parallel \hat{y}_B) \circ (\pi \otimes P)\) (in which \(\ell \) stands for the associated deterministic kernel) is a Bernoulli with mean \(\mathcal B_\pi (\mu , \nu )\).

Proof

\(\ell \) takes values in \(\{ 0,1\} \) so the law has to be Bernoulli. It has mean \(\mathcal B_\pi (\mu , \nu )\) because the expected risk of the generalized Bayes estimator is \(\mathcal B_\pi (\mu , \nu )\).

TODO: find a way to hide the following dummy lemma

Dummy node to summarize properties of the Bayes binary risk.

Proof

1.3 Reduction to testing

TODO: lower bounds on estimation by reduction to testing problems.

1.4 Sample complexity

TODO: we could generalize this to a sequence of kernels instead of iid observations.

An estimation problem of the shape \((P^{\otimes n}, y, \ell ')\) corresponds to estimating from \(n \in \mathbb {N}\) i.i.d. samples of the distribution. From the data-processing inequality, we get that the risk (Bayes risk, minimax risk...) decreases with the number of samples (Lemma 1.4.1 for example). Intuitively, having more information allows for better estimation.

In a sequence of estimation tasks \((P^{\otimes n}, y, \ell ')\) for \(n \in \mathbb {N}\), the sample complexity is the minimal \(n\) such that the risk is less than a desired value.

Lemma 1.4.1

If \(n \le m\) then \(\mathcal R_\pi ^{P^{\otimes n}} \ge \mathcal R_\pi ^{P^{\otimes m}}\).

Proof

Use the data-processing inequality, Theorem 1.1.4, for the deterministic kernel that projects on the first \(n\) coordinates.

Definition 9

The sample complexity of Bayesian estimation with respect to a prior \(\pi \in \mathcal M(\Theta )\) at risk level \(\delta \in \mathbb {R}_{+,\infty }\) is

\begin{align*} n_\pi ^P(\delta ) = \min \{ n \in \mathbb {N} \mid \mathcal R_\pi ^{P^{\otimes n}} \le \delta \} \: . \end{align*}