1 Testing, Estimation, Risk

In the estimation problem, we are given a family of measures and we want to estimate an unknown parameter for the measure from which we observe samples. Let \(\mathcal X\) be the measurable space on which the measures are defined. The measure family is described by a kernel \(P : \Theta \rightsquigarrow \mathcal X\). We write \(P_\theta \) for \(P(\theta )\). In parametric problems, \(\Theta \) is a low-dimensional space. In nonparametric problems \(\Theta \) is large, for example \(\Theta = \mathcal M (\mathcal X)\) or \(\Theta = \mathcal P(\mathcal X)\).

The goal of the estimation task is to find, given the observation of samples of \(P_\theta \) for an unknown \(\theta \), the value of a function \(y : \Theta \to \mathcal Y\) , which maps the parameter \(\theta \) to the quantity we are interested in. For example, \(y(\theta )\) might be the expected value of the distribution \(P_\theta \).

An estimator is a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\), where most often \(\mathcal Z = \mathcal Y\). Note that, while \(y\) is a function of a parameter in \(\Theta \), \(\hat{y}\) has domain \(\mathcal X\) as the estimation is done based on a sample of the measure, while the measure itself and its parameter remain unknown. In general this is a randomized estimator, but often deterministic estimators (corresponding to deterministic kernels) are enough to attain optimal performance. The quality of the estimation is quantified by a loss \(\ell ' : \mathcal Y \times \mathcal Z \to \mathbb {R}_{+, \infty }\). We write \(\ell \) for a loss on \(\mathcal Y \times \mathcal Y\) and \(\ell '\) in the heterogeneous case.

In short, an estimation problem is specified by the data \((P, y, \ell ')\).

Example: simple binary hypothesis testing (SBHT)

We say that an estimation problem is a testing problem if \(\mathcal Y = \mathcal Z\) is discrete and the loss takes values in \(\{ 0, 1\} \). A test is said to be binary if \(\mathcal Y\) has two elements. A test is simple if \(\{ \theta \mid y(\theta ) = y_0\} \) is a singleton for all \(y_0 \in \mathcal Y\), i.e. \(y\) is bijective.

In summary, in simple binary hypothesis testing, \(\Theta = \mathcal Y = \mathcal Z = \{ 0,1\} \), \(y(0) = 0\), \(y(1) = 1\) (that is, \(y\) is the identity). For \(y_0, z \in \{ 0,1\} \), \(\ell (y_0, z) = \mathbb {I}\{ y_0 \ne z\} \).

1.1 Risk

Let \(\hat{\mu }_\theta \) be the law of \(\hat{y}(\theta )\). That is, \(\hat{\mu } : \mathcal\Theta \rightsquigarrow \mathcal X\) is the kernel \(\hat{y} \circ P\).

Definition 1 Risk

✓

The risk of an estimator \(\hat{y}\) on the estimation problem \((P, y, \ell ')\) at \(\theta \in \Theta \) is \(r^P_\theta (\hat{y}) = (\hat{y} \circ P)(\theta )\left[z \mapsto \ell '(y(\theta ), z)\right]\) .

Example (SBHT): \(r^P_\theta (\hat{y}) = \hat{\mu }_\theta (\hat{y}(\theta ) \ne y(\theta ))\).

Definition 2 Bayesian risk

✓

The Bayesian risk of an estimator \(\hat{y}\) on \((P, y, \ell ')\) for a prior \(\pi \in \mathcal M(\Theta )\) is \(R^P_\pi (\hat{y}) = \pi \left[\theta \mapsto r^P_\theta (\hat{y})\right]\) . It can also be expanded as \(R^P_\pi (\hat{y}) = (\pi \otimes (\hat{y} \circ P))\left[ (\theta , z) \mapsto \ell '(y(\theta ), z) \right]\) .

If we write \(\ell '\) and \(y\) for their deterministic kernels, the Bayesian risk of \(\hat{y}\) is the mean of the measure \(\ell ' \circ (y \parallel \hat{y}) \circ (\pi \otimes P)\) .

Definition 3 Bayes risk

✓

The Bayes risk of \((P, y, \ell ')\) for prior \(\pi \in \mathcal M(\Theta )\) is \(\mathcal R^P_\pi = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} R^P_\pi (\hat{y})\) , where the infimum is over Markov kernels.

The Bayes risk of \((P, y, \ell ')\) is \(\mathcal R^*_B = \sup _{\pi \in \mathcal P(\Theta )} \mathcal R^P_\pi \: .\)

Example (SBHT): \(\mathcal R^*_B = \sup _{\gamma \in [0,1]}\inf _{\hat{y}}\left(\gamma \hat{\mu }_0(\hat{y} \ne 0) + (1 - \gamma ) \hat{\mu }_1(\hat{y} \ne 1)\right)\).

Definition 4 Bayes estimator

✓

An estimator \(\hat{y}\) is said to be a Bayes estimator for a prior \(\pi \in \mathcal P(\Theta )\) if \(R^P_\pi (\hat{y}) = \mathcal R^P_\pi \).

Definition 5 Minimax risk

✓

The minimax risk of \((P, y, \ell ')\) is \(\mathcal R^* = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} \sup _{\theta \in \Theta } r^P_\theta (\hat{y})\) .

Example (SBHT): \(\mathcal R^* = \inf _{\hat{y}} \max \{ \hat{\mu }_0(\hat{y} \ne 0), \hat{\mu }_1(\hat{y} \ne 1)\} \).

Lemma 1.1.1

✓

\(\mathcal R_B^* \le \mathcal R^*\).

Proof ▶

For any \(\pi \in \mathcal P(\mathcal X)\) and any estimator, \(\pi \left[\hat{\mu }_\theta \left[\ell '(y(\theta ), \hat{y}(\theta ))\right]\right] \le \sup _{\theta \in \Theta }\hat{\mu }_\theta \left[\ell '(y(\theta ), \hat{y}(\theta ))\right]\).

TODO: “often”, \(\mathcal R^*_B = \mathcal R^*\).

1.1.1 Properties of the Bayes risk of a prior

Lemma 1.1.2

✓

The Bayes risk of a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) with \(P\) a Markov kernel satisfies

\begin{align*} \mathcal R^P_\pi \le \inf _{z \in \mathcal Z} \pi \left[ \theta \mapsto \ell ’(y(\theta ), z) \right] \: . \end{align*}

Proof ▶

The infimum over all Markov kernels in the definition of the Bayes risk of \(\pi \) is bounded from above by the infimum restricted to constant, deterministic Markov kernels.

Lemma 1.1.3

The Bayes risk of a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) with \(P\) a constant Markov kernel is

\begin{align*} \mathcal R^P_\pi = \inf _{z \in \mathcal Z} \pi \left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

In particular, it does not depend on \(P\).

Proof ▶

Let \(\xi \) be the measure such that \(P(\theta ) = \xi \) for all \(\theta \).

\begin{align*} \mathcal R^P_\pi & = \inf _{\hat{y}}(\pi \times (\hat{y} \circ \xi ))\left[(\theta , z)\mapsto \ell ’(y(\theta ), z)\right] \\ & = \inf _{\mu \in \mathcal P(\mathcal X)} (\pi \times \mu )\left[(\theta , z)\mapsto \ell ’(y(\theta ), z)\right] \\ & = \inf _{\mu \in \mathcal P(\mathcal X)} \mu \left[ z \mapsto \pi \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \\ & = \inf _{z \in \mathcal Z} \pi \left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

Data-processing inequality and consequences

Theorem 1.1.4 Data-processing inequality

✓

For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{\kappa \circ P}_\pi \ge \mathcal R^{P}_\pi \) (where the estimation problems differ only in the kernel).

Proof ▶

The risk \(\mathcal R^{\kappa \circ P}_\pi \) is

\begin{align*} \mathcal R^{\kappa \circ P}_\pi & = \inf _{\hat{y} : \mathcal X' \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ \kappa \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & \ge \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = \mathcal R^{P}_\pi \: . \end{align*}

The inequality is due to taking an infimum over a larger set: \(\{ \hat{y} \circ \kappa \mid \hat{y} : \mathcal X' \rightsquigarrow \mathcal Z \text{ Markov}\} \subseteq \{ \hat{y}' : \mathcal X \rightsquigarrow \mathcal Z \mid \hat{y} \text{ Markov}\} \) .

Lemma 1.1.5

✓

For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \Theta \times \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{P \otimes \kappa }_\pi \le \mathcal R^{P}_\pi \) .

Proof ▶

Use Theorem 1.1.4: \(P\) is the composition of \(P \otimes \kappa \) and the deterministic kernel \(\mathcal X \times \mathcal X' \rightsquigarrow \mathcal X\) that forgets the second value.

Lemma 1.1.6

For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \Theta \times \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{P \otimes \kappa }_\pi \le \mathcal R^{(P \otimes \kappa )_{\mathcal X'}}_\pi \) , in which \((P \otimes \kappa )_{\mathcal X'} : \mathcal\Theta \rightsquigarrow \mathcal X'\) is the kernel obtained by marginalizing over \(\mathcal X\) in the output of \(P \otimes \kappa \) .

Proof ▶

Use Theorem 1.1.4: \((P \otimes \kappa )_{\mathcal X'}\) is the composition of \(P \otimes \kappa \) and the deterministic kernel \(\mathcal X \times \mathcal X' \rightsquigarrow \mathcal X'\) that forgets the first value.

Lemma 1.1.7

The Bayes risk \(\mathcal R_\pi ^P\) is concave in \(P : \Theta \rightsquigarrow \mathcal X\) .

Proof ▶

The infimum of a sum is larger than the sum of the infimums:

\begin{align*} \mathcal R_\pi ^{\lambda P_1 + (1 - \lambda )P_2} & = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ (\lambda P_1 + (1 - \lambda )P_2)))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} \left( \lambda (\pi \otimes (\hat{y} \circ P_1))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \right. \\ & \qquad \left. + (1 - \lambda ) (\pi \otimes (\hat{y} \circ P_2))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \right) \\ & \ge \lambda \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P_1))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & \qquad + (1 - \lambda ) \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P_2))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = \lambda \mathcal R_\pi ^{P_1} + (1 - \lambda )\mathcal R_\pi ^{P_2} \: . \end{align*}

TODO: can we build kernels such that this lemma is a consequence of one of the lemmas above?

Generalized Bayes estimator

Lemma 1.1.8

✓

The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies

\begin{align*} R^P_\pi (\hat{y}) = ((P_\pi ^\dagger \times \hat{y}) \circ P \circ \pi )\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \: , \end{align*}

whenever the Bayesian inverse \(P_\pi ^\dagger \) of \(P\) with respect to \(\pi \) exists (Definition 48).

Proof ▶

Use the main property of the Bayesian inverse.

\begin{align*} (P_\pi ^\dagger \times \hat{y}) \circ P \circ \pi & = (\mathrm{id} \parallel \hat{y}) \circ (P_\pi ^\dagger \times \mathrm{id}) \circ (P \circ \pi ) \\ & = (\mathrm{id} \parallel \hat{y}) \circ (\mathrm{id} \times P) \circ \pi \\ & = (\mathrm{id} \times (\hat{y} \circ P)) \circ \pi \\ & = \pi \otimes (\hat{y} \circ P) \: . \end{align*}

Lemma 1.1.9

✓

The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies

\begin{align*} R^P_\pi (\hat{y}) \ge (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}

Proof ▶

Starting from the equality of Lemma 1.1.8, we get

\begin{align*} R^P_\pi (\hat{y}) & = ((P_\pi ^\dagger \times \hat{y}) \circ P \circ \pi )\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \\ & = (P \circ \pi )\left[x \mapsto (P_\pi ^\dagger (x) \times \hat{y}(x)) \left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right]\right] \\ & = (P \circ \pi )\left[x \mapsto \hat{y}(x)\left[z \mapsto P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right]\right] \\ & \ge (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}

Definition 6 Generalized Bayes estimator

✓

The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on \((P, y, \ell ')\) is the deterministic estimator \(\mathcal X \to \mathcal Z\) given by

\begin{align*} x \mapsto \arg \min _z P_\pi ^\dagger (x)\left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: , \end{align*}

if there exists such a measurable argmin.

Lemma 1.1.10

✓

The Bayesian risk of the generalized Bayes estimator \(\hat{y}_B\) is

\begin{align*} R^P_\pi (\hat{y}_B) = (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}

Proof ▶

Start from the equality of Lemma 1.1.8 and use the definition of the generalized Bayes estimator.

Theorem 1.1.11

✓

When the generalized Bayes estimator is well defined, it is a Bayes estimator. The value of the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is then

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\pi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}

Proof ▶

By Lemma 1.1.9, the Bayesian risk of the generalized Bayes estimator obtained in Lemma 1.1.10 is a lower bound for the Bayesian risk.

Lemma 1.1.12

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )\left[x \mapsto \inf _{z \in \mathcal Z} \pi \left[\theta \mapsto \frac{d P_\theta }{d (P \circ \pi )}(x) \ell ’(y(\theta ), z)\right]\right] \: . \end{align*}

Proof ▶

Lemma 1.1.13

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )[1] - (P \circ \pi )\left[x \mapsto \sup _{z \in \mathcal Z} P_\pi ^\dagger (x) \{ z\} \right] \: . \end{align*}

Proof ▶

Lemma 1.1.14

\begin{align*} \mathcal R^P_\pi = (P \circ \pi )[1] - (P \circ \pi )\left[x \mapsto \sup _{z \in \mathcal Z} \frac{d P_z}{d (P \circ \pi )}(x) \pi \{ z\} \right] \: . \end{align*}

When \(\pi \) is a probability measure and \(P\) is a Markov kernel, \((P \circ \pi )[1] = 1\).

Proof ▶

Lemma 1.1.15

The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on the estimation problem defined by \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is

\begin{align*} x \mapsto \arg \max _z \left( \pi \{ z\} \frac{d P_z}{d (P \circ \pi )}(x) \right) \: . \end{align*}

Proof ▶

Lemma 1.1.16

Suppose that \(\Theta \) is finite and let \(\xi \in \mathcal P(\Theta )\). The Bayes risk with respect to the prior \(\pi \in \mathcal P(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\), \(P\) a Markov kernel and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) satisfies

\begin{align*} \mathcal R^P_\pi \le 1 - \left(\prod _{\theta \in \Theta } \pi _\theta ^{\xi _\theta }\right) (P \circ \pi )\left[x \mapsto \prod _{\theta \in \Theta } \left(\frac{d P_\theta }{d (P \circ \pi )}(x)\right)^{\xi _\theta } \right] \: . \end{align*}

Proof ▶

1.1.2 Bayes risk increase

TODO: remove this section? It’s not used for anything now.

The following is a new definition, which is a natural generalization of the DeGroot statistical information.

Definition 7

✓

The Bayes risk increase \(I^P_{\pi }(\kappa )\) of a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) with respect to the estimation problem \((P, y, \ell ')\) and the prior \(\pi \in \mathcal M(\Theta )\) is the difference of the Bayes risk of \((\kappa \circ P, y, \ell ')\) and that of \((P, y, \ell ')\). That is,

\begin{align*} I^P_{\pi }(\kappa ) & = \mathcal R^{\kappa \circ P}_\pi - \mathcal R^P_\pi \\ & = \inf _{\hat{y} : \mathcal X' \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ \kappa \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] - \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} (\pi \otimes (\hat{y} \circ P))\left[(\theta , z) \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

In particular, if we take \(\kappa : \mathcal X \rightsquigarrow *\) equal to the Markov kernel to the point space (denoted by \(d_{\mathcal X}\), for “discard” kernel), we get \(I^P_{\pi }(d_{\mathcal X}) = \mathcal R^{d_{\mathcal X} \circ P}_\pi - \mathcal R^P_\pi \), where \(d_{\mathcal X} \circ P = d_\Theta \) if \(P\) is a Markov kernel. We recover a difference between the risk of the best constant kernel and the Bayes estimator, as in the DeGroot statistical information.

The risk increase quantifies how much risk we accrue by forgetting information due to the composition with \(\kappa \).

Lemma 1.1.17

For \(\kappa \) a Markov kernel, \(I^P_\pi (\kappa ) \ge 0\) .

Proof ▶

Use Theorem 1.1.4.

Lemma 1.1.18

✓

For \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) and \(\eta : \mathcal X' \rightsquigarrow \mathcal X''\) two Markov kernels, \(I^P_\pi (\eta \circ \kappa ) = I^P_\pi (\kappa ) + I^{\kappa \circ P}_\pi (\eta )\) .

Proof ▶

\begin{align*} I^P_\pi (\kappa ) + I^{\kappa \circ P}_\pi (\eta ) & = \mathcal R^{\kappa \circ P}_\pi - \mathcal R^{P}_\pi + \mathcal R^{\eta \circ \kappa \circ P}_\pi - \mathcal R^{\kappa \circ P}_\pi \\ & = \mathcal R^{\eta \circ \kappa \circ P}_\pi - \mathcal R^{P}_\pi \\ & = I^P_\pi (\eta \circ \kappa ) \: . \end{align*}

Lemma 1.1.19 Data-processing inequality

✓

For any measurable space \(\mathcal X\), let \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) be the Markov kernel to the point space. For all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\),

\begin{align*} I_\pi ^P(d_{\mathcal X}) \ge I_\pi ^{\kappa \circ P}(d_{\mathcal X'}) \: . \end{align*}

Proof ▶

By Lemma 1.1.18, then Lemma 1.1.17,

\begin{align*} I_\pi ^P(d_{\mathcal X'} \circ \kappa ) = I^P_\pi (\kappa ) + I^{\kappa \circ P}_\pi (d_{\mathcal X'}) \ge I_\pi ^{\kappa \circ P}(d_{\mathcal X'}) \: . \end{align*}

Finally, \(d_{\mathcal X'} \circ \kappa = d_{\mathcal X}\) since \(\kappa \) is a Markov kernel.

1.2 Simple binary hypothesis testing

In this section, for a measure \(\xi \) on \(\{ 0,1\} \), we write \(\xi _0\) for \(\xi (\{ 0\} )\) and \(\xi _1\) for \(\xi (\{ 1\} )\) . We sometimes write \((a,b)\) for the measure \(\xi \) on \(\{ 0,1\} \) with \(\xi _0 = a\) and \(\xi _1 = b\).

Lemma 1.2.1

✓

The Bayesian inverse of a kernel \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) with respect to a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(P_\xi ^\dagger (x) = \left(\xi _0\frac{d P(0)}{d(P \circ \xi )}(x), \xi _1\frac{d P(1)}{d(P \circ \xi )}(x)\right)\) (almost surely w.r.t. \(P \circ \xi = \xi _0 P(0) + \xi _1 P(1)\)).

Proof ▶

Lemma 1.2.2

✓

For \(\Theta = \{ 0,1\} \), the Bayes risk of a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is

\begin{align*} \mathcal R^P_\xi = (P \circ \xi )\left[x \mapsto \inf _{z \in \mathcal Z} \left( \xi _0\frac{d P(0)}{d(P \circ \xi )}(x)\ell ’(y(0), z) + \xi _1\frac{d P(1)}{d(P \circ \xi )}(x)\ell ’(y(1), z) \right) \right] \: . \end{align*}

Proof ▶

Use Theorem 1.1.11, with the value of \(P_\xi ^\dagger \) given by Lemma 1.2.1.

1.2.1 Bayes risk of a prior

Definition 8

✓

The Bayes binary risk between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\), denoted by \(\mathcal B_\xi (\mu , \nu )\), is the Bayes risk \(\mathcal R^P_\xi \) for \(\Theta = \mathcal Y = \mathcal Z = \{ 0,1\} \), \(\ell (y,z) = \mathbb {I}\{ y \ne z\} \), \(P\) the kernel sending 0 to \(\mu \) and 1 to \(\nu \) and prior \(\xi \). That is,

\begin{align*} \mathcal B_\xi (\mu , \nu ) = \inf _{\hat{y} : \mathcal X \rightsquigarrow \{ 0,1\} }\left(\xi _0 (\hat{y} \circ \mu )(\{ 1\} ) + \xi _1 (\hat{y} \circ \nu )(\{ 0\} )\right) \: , \end{align*}

in which the infimum is over Markov kernels.

If the prior is a probability measure with weights \((\pi , 1 - \pi )\), we write \(B_\pi (\mu , \nu ) = \mathcal B_{(\pi , 1 - \pi )}(\mu , \nu )\) .

Thanks to the following lemma, we can prove properties of the Bayes binary risk for non-zero finite measures by considering only probability measures.

Lemma 1.2.3

✓

For all \(a, b {\gt} 0\), \(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{(a \xi _0, b \xi _1)}(a^{-1} \mu , b^{-1} \nu )\) .

Proof ▶

Alternatively, we can reduce the study of the Bayes binary risk for any prior to the study of the prior \((1, 1)\), for which that risk is related to the total variation distance (defined later).

Lemma 1.2.4

✓

\(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{(1,1)}(\xi _0\mu , \xi _1\nu )\) .

Proof ▶

Theorem 1.2.5 Data-processing inequality

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \ge \mathcal B_\xi (\mu , \nu )\) .

Proof ▶

Apply Theorem 1.1.4.

Lemma 1.2.6

✓

For \(\mu \in \mathcal M(\mathcal X)\), \(\mathcal B_\xi (\mu , \mu ) = \min \{ \xi _0, \xi _1\} \mu (\mathcal X)\) .

Proof ▶

Lemma 1.2.7

For all measures \(\mu , \nu \), \(\mathcal B_\xi (\mu , \nu ) \ge 0\) .

Proof ▶

It is an infimum of non-negative values.

Lemma 1.2.8

✓

For all measures \(\mu , \nu \), \(\mathcal B_\xi (\mu , \nu ) \le \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} \) .

Proof ▶

Let \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) be the discard kernel and \(\delta _*\) be the only probability measure over \(*\), that is a Dirac delta. Then applying the DPI (Theorem 1.2.5), followed by Lemma 1.2.3 and Lemma 1.2.6, we obtain:

\begin{align*} \mathcal B_\xi (\mu , \nu ) & \le \mathcal B_\xi (d_{\mathcal X} \circ \mu , d_{\mathcal X} \circ \nu ) \\ & = \mathcal B_\xi (\mu (\mathcal X) \delta _*, \nu (\mathcal X) \delta _*) \\ & = \mathcal B_{(\mu (\mathcal X) \xi _0, \nu (\mathcal X) \xi _1)}(\delta _*, \delta _*) \\ & = \min \{ \mu (\mathcal X) \xi _0, \nu (\mathcal X) \xi _1\} \delta _*(\mathcal X) \\ & = \min \{ \mu (\mathcal X) \xi _0, \nu (\mathcal X) \xi _1\} \: . \end{align*}

Lemma 1.2.9

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\xi \in \mathcal M(\{ 0,1\} )\), \(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{\xi _{\leftrightarrow }}(\nu , \mu )\) where \(\xi _{\leftrightarrow } \in \mathcal M(\{ 0,1\} )\) is such that \(\xi _{\leftrightarrow }(\{ 0\} ) = \xi _1\) and \(\xi _{\leftrightarrow }(\{ 1\} ) = \xi _0\). For \(\pi \in [0,1]\), \(B_\pi (\mu , \nu ) = B_{1 - \pi }(\nu , \mu )\) .

Proof ▶

Lemma 1.2.10

✓

The generalized Bayes estimator for the Bayes binary risk with prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(x \mapsto \text{if } \xi _1\frac{d \nu }{d(P \circ \xi )}(x) \le \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \text{ then } 0 \text{ else } 1\), i.e. it is equal to \(\mathbb {I}_E\) for \(E = \{ x \mid \xi _1\frac{d \nu }{d(P \circ \xi )}(x) {\gt} \xi _0\frac{d \mu }{d(P \circ \xi )}(x)\} \) .

Proof ▶

The generalized Bayes estimator is defined by

\begin{align*} x \mapsto \arg \min _z P_\xi ^\dagger (x)\left[\theta \mapsto \ell ’(y(\theta ), z)\right] \: . \end{align*}

For the simple binary hypothesis testing problem,

\begin{align*} P_\xi ^\dagger (x)\left[\theta \mapsto \ell ’(y(\theta ), z)\right] & = P_\xi ^\dagger (x)(\theta \ne z) \: . \end{align*}

For \(z = 0\), this is \(P_\xi ^\dagger (x)(\{ 1\} )\) and for \(z = 1\), this is \(P_\xi ^\dagger (x)(\{ 0\} )\). The generalized Bayes estimator is \(x \mapsto \text{if } P_\xi ^\dagger (x)(\{ 1\} ) \le P_\xi ^\dagger (x)(\{ 0\} ) \text{ then } 0 \text{ else } 1\).

The expression of the lemma statement is obtained by replacing \(P_\xi ^\dagger \) by its value (Lemma 1.2.1).

Lemma 1.2.11

✓

\begin{align*} \mathcal B_\xi (\mu , \nu ) = \inf _{E \text{ event}} \left( \xi _0 \mu (E) + \xi _1 \nu (E^c) \right) \: . \end{align*}

Proof ▶

By definition,

If we restrict the infimum to deterministic kernels corresponding to functions of the form \(x \mapsto \mathbb {I}\{ x \in E\} \), we get that \(\mathcal B_\xi (\mu , \nu )\) is bounded from above by the expression for which we want to show equality.

Since the Bayes estimator has that shape (Lemma 1.2.10), \(\mathcal B_\xi (\mu , \nu )\) is also greater than the infimum and we have equality.

Theorem 1.2.12

✓

The Bayes risk of simple binary hypothesis testing for prior \(\xi \in \mathcal M(\{ 0,1\} )\) is

\begin{align*} \mathcal B_\xi (\mu , \nu ) = (P \circ \xi )\left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d(P \circ \xi )}(x), \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\} \right] \: . \end{align*}

Proof ▶

By Theorem 1.1.11, the generalized Bayes estimator is a Bayes estimator, hence it suffices to compute its Bayesian risk. By Lemma 1.1.10,

\begin{align*} R_\xi (\hat{y}_B) & = (P \circ \xi )\left[x \mapsto \inf _{z \in \mathcal Z} P_\xi ^\dagger (x) \left[\theta \mapsto \ell ’(y(\theta ), z)\right]\right] \\ & = (P \circ \xi )\left[x \mapsto \min \left\{ P_\xi ^\dagger (x)(\{ 0\} ), P_\xi ^\dagger (x)(\{ 1\} )\right\} \right] \\ & = (P \circ \xi )\left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d(P \circ \xi )}(x), \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\} \right] \: . \end{align*}

The last line is obtained by replacing \(P_\xi ^\dagger \) by its value (Lemma 1.2.1).

Corollary 1.2.13

✓

\begin{align*} \mathcal B_\xi (\mu , \nu ) = \frac{1}{2}\left((P \circ \xi )(\mathcal X) - (P \circ \xi )\left[x \mapsto \left\vert \xi _0\frac{d \mu }{d(P \circ \xi )}(x) - \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\vert \right] \right) \: . \end{align*}

Proof ▶

Use Theorem 1.2.12 and the equality \(\min \{ a,b\} = \frac{1}{2}(a + b - \vert a - b \vert )\).

Lemma 1.2.14

Let \(\hat{y}_B\) be the generalized Bayes estimator for simple binary hypothesis testing. The distribution \(\ell \circ (\mathrm{id} \parallel \hat{y}_B) \circ (\pi \otimes P)\) (in which \(\ell \) stands for the associated deterministic kernel) is a Bernoulli with mean \(\mathcal B_\pi (\mu , \nu )\).

Proof ▶

\(\ell \) takes values in \(\{ 0,1\} \) so the law has to be Bernoulli. It has mean \(\mathcal B_\pi (\mu , \nu )\) because the expected risk of the generalized Bayes estimator is \(\mathcal B_\pi (\mu , \nu )\).

TODO: find a way to hide the following dummy lemma

Lemma 1.2.15 Dummy lemma: bayesBinaryRisk properties

✓

Dummy node to summarize properties of the Bayes binary risk.

Proof ▶

1.3 Reduction to testing

TODO: lower bounds on estimation by reduction to testing problems.

1.4 Sample complexity

TODO: we could generalize this to a sequence of kernels instead of iid observations.

An estimation problem of the shape \((P^{\otimes n}, y, \ell ')\) corresponds to estimating from \(n \in \mathbb {N}\) i.i.d. samples of the distribution. From the data-processing inequality, we get that the risk (Bayes risk, minimax risk...) decreases with the number of samples (Lemma 1.4.1 for example). Intuitively, having more information allows for better estimation.

In a sequence of estimation tasks \((P^{\otimes n}, y, \ell ')\) for \(n \in \mathbb {N}\), the sample complexity is the minimal \(n\) such that the risk is less than a desired value.

Lemma 1.4.1

If \(n \le m\) then \(\mathcal R_\pi ^{P^{\otimes n}} \ge \mathcal R_\pi ^{P^{\otimes m}}\).

Proof ▶

Use the data-processing inequality, Theorem 1.1.4, for the deterministic kernel that projects on the first \(n\) coordinates.

Definition 9

The sample complexity of Bayesian estimation with respect to a prior \(\pi \in \mathcal M(\Theta )\) at risk level \(\delta \in \mathbb {R}_{+,\infty }\) is

\begin{align*} n_\pi ^P(\delta ) = \min \{ n \in \mathbb {N} \mid \mathcal R_\pi ^{P^{\otimes n}} \le \delta \} \: . \end{align*}