1 Testing, Estimation, Risk
In the estimation problem, we are given a family of measures and we want to estimate an unknown parameter for the measure from which we observe samples. Let \(\mathcal X\) be the measurable space on which the measures are defined. The measure family is described by a kernel \(P : \Theta \rightsquigarrow \mathcal X\). We write \(P_\theta \) for \(P(\theta )\). In parametric problems, \(\Theta \) is a low-dimensional space. In nonparametric problems \(\Theta \) is large, for example \(\Theta = \mathcal M (\mathcal X)\) or \(\Theta = \mathcal P(\mathcal X)\).
The goal of the estimation task is to find, given the observation of samples of \(P_\theta \) for an unknown \(\theta \), the value of a function \(y : \Theta \to \mathcal Y\) , which maps the parameter \(\theta \) to the quantity we are interested in. For example, \(y(\theta )\) might be the expected value of the distribution \(P_\theta \).
An estimator is a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\), where most often \(\mathcal Z = \mathcal Y\). Note that, while \(y\) is a function of a parameter in \(\Theta \), \(\hat{y}\) has domain \(\mathcal X\) as the estimation is done based on a sample of the measure, while the measure itself and its parameter remain unknown. In general this is a randomized estimator, but often deterministic estimators (corresponding to deterministic kernels) are enough to attain optimal performance. The quality of the estimation is quantified by a loss \(\ell ' : \mathcal Y \times \mathcal Z \to \mathbb {R}_{+, \infty }\). We write \(\ell \) for a loss on \(\mathcal Y \times \mathcal Y\) and \(\ell '\) in the heterogeneous case.
In short, an estimation problem is specified by the data \((P, y, \ell ')\).
Example: simple binary hypothesis testing (SBHT)
We say that an estimation problem is a testing problem if \(\mathcal Y = \mathcal Z\) is discrete and the loss takes values in \(\{ 0, 1\} \). A test is said to be binary if \(\mathcal Y\) has two elements. A test is simple if \(\{ \theta \mid y(\theta ) = y_0\} \) is a singleton for all \(y_0 \in \mathcal Y\), i.e. \(y\) is bijective.
In summary, in simple binary hypothesis testing, \(\Theta = \mathcal Y = \mathcal Z = \{ 0,1\} \), \(y(0) = 0\), \(y(1) = 1\) (that is, \(y\) is the identity). For \(y_0, z \in \{ 0,1\} \), \(\ell (y_0, z) = \mathbb {I}\{ y_0 \ne z\} \).
1.1 Risk
Let \(\hat{\mu }_\theta \) be the law of \(\hat{y}(\theta )\). That is, \(\hat{\mu } : \mathcal\Theta \rightsquigarrow \mathcal X\) is the kernel \(\hat{y} \circ P\).
The risk of an estimator \(\hat{y}\) on the estimation problem \((P, y, \ell ')\) at \(\theta \in \Theta \) is \(r^P_\theta (\hat{y}) = (\hat{y} \circ P)(\theta )\left[z \mapsto \ell '(y(\theta ), z)\right]\) .
Example (SBHT): \(r^P_\theta (\hat{y}) = \hat{\mu }_\theta (\hat{y}(\theta ) \ne y(\theta ))\).
The Bayesian risk of an estimator \(\hat{y}\) on \((P, y, \ell ')\) for a prior \(\pi \in \mathcal M(\Theta )\) is \(R^P_\pi (\hat{y}) = \pi \left[\theta \mapsto r^P_\theta (\hat{y})\right]\) . It can also be expanded as \(R^P_\pi (\hat{y}) = (\pi \otimes (\hat{y} \circ P))\left[ (\theta , z) \mapsto \ell '(y(\theta ), z) \right]\) .
If we write \(\ell '\) and \(y\) for their deterministic kernels, the Bayesian risk of \(\hat{y}\) is the mean of the measure \(\ell ' \circ (y \parallel \hat{y}) \circ (\pi \otimes P)\) .
The Bayes risk of \((P, y, \ell ')\) for prior \(\pi \in \mathcal M(\Theta )\) is \(\mathcal R^P_\pi = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} R^P_\pi (\hat{y})\) , where the infimum is over Markov kernels.
The Bayes risk of \((P, y, \ell ')\) is \(\mathcal R^*_B = \sup _{\pi \in \mathcal P(\Theta )} \mathcal R^P_\pi \: .\)
Example (SBHT): \(\mathcal R^*_B = \sup _{\gamma \in [0,1]}\inf _{\hat{y}}\left(\gamma \hat{\mu }_0(\hat{y} \ne 0) + (1 - \gamma ) \hat{\mu }_1(\hat{y} \ne 1)\right)\).
An estimator \(\hat{y}\) is said to be a Bayes estimator for a prior \(\pi \in \mathcal P(\Theta )\) if \(R^P_\pi (\hat{y}) = \mathcal R^P_\pi \).
The minimax risk of \((P, y, \ell ')\) is \(\mathcal R^* = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} \sup _{\theta \in \Theta } r^P_\theta (\hat{y})\) .
Example (SBHT): \(\mathcal R^* = \inf _{\hat{y}} \max \{ \hat{\mu }_0(\hat{y} \ne 0), \hat{\mu }_1(\hat{y} \ne 1)\} \).
\(\mathcal R_B^* \le \mathcal R^*\).
For any \(\pi \in \mathcal P(\mathcal X)\) and any estimator, \(\pi \left[\hat{\mu }_\theta \left[\ell '(y(\theta ), \hat{y}(\theta ))\right]\right] \le \sup _{\theta \in \Theta }\hat{\mu }_\theta \left[\ell '(y(\theta ), \hat{y}(\theta ))\right]\).
TODO: “often”, \(\mathcal R^*_B = \mathcal R^*\).
1.1.1 Properties of the Bayes risk of a prior
The Bayes risk of a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) with \(P\) a Markov kernel satisfies
The infimum over all Markov kernels in the definition of the Bayes risk of \(\pi \) is bounded from above by the infimum restricted to constant, deterministic Markov kernels.
The Bayes risk of a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) with \(P\) a constant Markov kernel is
In particular, it does not depend on \(P\).
Let \(\xi \) be the measure such that \(P(\theta ) = \xi \) for all \(\theta \).
Data-processing inequality and consequences
For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{\kappa \circ P}_\pi \ge \mathcal R^{P}_\pi \) (where the estimation problems differ only in the kernel).
The risk \(\mathcal R^{\kappa \circ P}_\pi \) is
The inequality is due to taking an infimum over a larger set: \(\{ \hat{y} \circ \kappa \mid \hat{y} : \mathcal X' \rightsquigarrow \mathcal Z \text{ Markov}\} \subseteq \{ \hat{y}' : \mathcal X \rightsquigarrow \mathcal Z \mid \hat{y} \text{ Markov}\} \) .
For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \Theta \times \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{P \otimes \kappa }_\pi \le \mathcal R^{P}_\pi \) .
Use Theorem 1.1.4: \(P\) is the composition of \(P \otimes \kappa \) and the deterministic kernel \(\mathcal X \times \mathcal X' \rightsquigarrow \mathcal X\) that forgets the second value.
For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \Theta \times \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{P \otimes \kappa }_\pi \le \mathcal R^{(P \otimes \kappa )_{\mathcal X'}}_\pi \) , in which \((P \otimes \kappa )_{\mathcal X'} : \mathcal\Theta \rightsquigarrow \mathcal X'\) is the kernel obtained by marginalizing over \(\mathcal X\) in the output of \(P \otimes \kappa \) .
Use Theorem 1.1.4: \((P \otimes \kappa )_{\mathcal X'}\) is the composition of \(P \otimes \kappa \) and the deterministic kernel \(\mathcal X \times \mathcal X' \rightsquigarrow \mathcal X'\) that forgets the first value.
The Bayes risk \(\mathcal R_\pi ^P\) is concave in \(P : \Theta \rightsquigarrow \mathcal X\) .
The infimum of a sum is larger than the sum of the infimums:
TODO: can we build kernels such that this lemma is a consequence of one of the lemmas above?
Generalized Bayes estimator
The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies
whenever the Bayesian inverse \(P_\pi ^\dagger \) of \(P\) with respect to \(\pi \) exists (Definition 48).
Use the main property of the Bayesian inverse.
The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies
Starting from the equality of Lemma 1.1.8, we get
The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on \((P, y, \ell ')\) is the deterministic estimator \(\mathcal X \to \mathcal Z\) given by
if there exists such a measurable argmin.
The Bayesian risk of the generalized Bayes estimator \(\hat{y}_B\) is
Start from the equality of Lemma 1.1.8 and use the definition of the generalized Bayes estimator.
When the generalized Bayes estimator is well defined, it is a Bayes estimator. The value of the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is then
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is
When \(\pi \) is a probability measure and \(P\) is a Markov kernel, \((P \circ \pi )[1] = 1\).
The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on the estimation problem defined by \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is
Suppose that \(\Theta \) is finite and let \(\xi \in \mathcal P(\Theta )\). The Bayes risk with respect to the prior \(\pi \in \mathcal P(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\), \(P\) a Markov kernel and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) satisfies
1.1.2 Bayes risk increase
TODO: remove this section? It’s not used for anything now.
The following is a new definition, which is a natural generalization of the DeGroot statistical information.
The Bayes risk increase \(I^P_{\pi }(\kappa )\) of a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) with respect to the estimation problem \((P, y, \ell ')\) and the prior \(\pi \in \mathcal M(\Theta )\) is the difference of the Bayes risk of \((\kappa \circ P, y, \ell ')\) and that of \((P, y, \ell ')\). That is,
In particular, if we take \(\kappa : \mathcal X \rightsquigarrow *\) equal to the Markov kernel to the point space (denoted by \(d_{\mathcal X}\), for “discard” kernel), we get \(I^P_{\pi }(d_{\mathcal X}) = \mathcal R^{d_{\mathcal X} \circ P}_\pi - \mathcal R^P_\pi \), where \(d_{\mathcal X} \circ P = d_\Theta \) if \(P\) is a Markov kernel. We recover a difference between the risk of the best constant kernel and the Bayes estimator, as in the DeGroot statistical information.
The risk increase quantifies how much risk we accrue by forgetting information due to the composition with \(\kappa \).
For \(\kappa \) a Markov kernel, \(I^P_\pi (\kappa ) \ge 0\) .
Use Theorem 1.1.4.
For \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) and \(\eta : \mathcal X' \rightsquigarrow \mathcal X''\) two Markov kernels, \(I^P_\pi (\eta \circ \kappa ) = I^P_\pi (\kappa ) + I^{\kappa \circ P}_\pi (\eta )\) .
For any measurable space \(\mathcal X\), let \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) be the Markov kernel to the point space. For all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\),
By Lemma 1.1.18, then Lemma 1.1.17,
Finally, \(d_{\mathcal X'} \circ \kappa = d_{\mathcal X}\) since \(\kappa \) is a Markov kernel.
1.2 Simple binary hypothesis testing
In this section, for a measure \(\xi \) on \(\{ 0,1\} \), we write \(\xi _0\) for \(\xi (\{ 0\} )\) and \(\xi _1\) for \(\xi (\{ 1\} )\) . We sometimes write \((a,b)\) for the measure \(\xi \) on \(\{ 0,1\} \) with \(\xi _0 = a\) and \(\xi _1 = b\).
The Bayesian inverse of a kernel \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) with respect to a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(P_\xi ^\dagger (x) = \left(\xi _0\frac{d P(0)}{d(P \circ \xi )}(x), \xi _1\frac{d P(1)}{d(P \circ \xi )}(x)\right)\) (almost surely w.r.t. \(P \circ \xi = \xi _0 P(0) + \xi _1 P(1)\)).
For \(\Theta = \{ 0,1\} \), the Bayes risk of a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is
1.2.1 Bayes risk of a prior
The Bayes binary risk between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\), denoted by \(\mathcal B_\xi (\mu , \nu )\), is the Bayes risk \(\mathcal R^P_\xi \) for \(\Theta = \mathcal Y = \mathcal Z = \{ 0,1\} \), \(\ell (y,z) = \mathbb {I}\{ y \ne z\} \), \(P\) the kernel sending 0 to \(\mu \) and 1 to \(\nu \) and prior \(\xi \). That is,
in which the infimum is over Markov kernels.
If the prior is a probability measure with weights \((\pi , 1 - \pi )\), we write \(B_\pi (\mu , \nu ) = \mathcal B_{(\pi , 1 - \pi )}(\mu , \nu )\) .
Thanks to the following lemma, we can prove properties of the Bayes binary risk for non-zero finite measures by considering only probability measures.
For all \(a, b {\gt} 0\), \(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{(a \xi _0, b \xi _1)}(a^{-1} \mu , b^{-1} \nu )\) .
Alternatively, we can reduce the study of the Bayes binary risk for any prior to the study of the prior \((1, 1)\), for which that risk is related to the total variation distance (defined later).
\(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{(1,1)}(\xi _0\mu , \xi _1\nu )\) .
For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \ge \mathcal B_\xi (\mu , \nu )\) .
Apply Theorem 1.1.4.
For \(\mu \in \mathcal M(\mathcal X)\), \(\mathcal B_\xi (\mu , \mu ) = \min \{ \xi _0, \xi _1\} \mu (\mathcal X)\) .
For all measures \(\mu , \nu \), \(\mathcal B_\xi (\mu , \nu ) \ge 0\) .
It is an infimum of non-negative values.
For all measures \(\mu , \nu \), \(\mathcal B_\xi (\mu , \nu ) \le \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} \) .
Let \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) be the discard kernel and \(\delta _*\) be the only probability measure over \(*\), that is a Dirac delta. Then applying the DPI (Theorem 1.2.5), followed by Lemma 1.2.3 and Lemma 1.2.6, we obtain:
For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\xi \in \mathcal M(\{ 0,1\} )\), \(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{\xi _{\leftrightarrow }}(\nu , \mu )\) where \(\xi _{\leftrightarrow } \in \mathcal M(\{ 0,1\} )\) is such that \(\xi _{\leftrightarrow }(\{ 0\} ) = \xi _1\) and \(\xi _{\leftrightarrow }(\{ 1\} ) = \xi _0\). For \(\pi \in [0,1]\), \(B_\pi (\mu , \nu ) = B_{1 - \pi }(\nu , \mu )\) .
The generalized Bayes estimator for the Bayes binary risk with prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(x \mapsto \text{if } \xi _1\frac{d \nu }{d(P \circ \xi )}(x) \le \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \text{ then } 0 \text{ else } 1\), i.e. it is equal to \(\mathbb {I}_E\) for \(E = \{ x \mid \xi _1\frac{d \nu }{d(P \circ \xi )}(x) {\gt} \xi _0\frac{d \mu }{d(P \circ \xi )}(x)\} \) .
The generalized Bayes estimator is defined by
For the simple binary hypothesis testing problem,
For \(z = 0\), this is \(P_\xi ^\dagger (x)(\{ 1\} )\) and for \(z = 1\), this is \(P_\xi ^\dagger (x)(\{ 0\} )\). The generalized Bayes estimator is \(x \mapsto \text{if } P_\xi ^\dagger (x)(\{ 1\} ) \le P_\xi ^\dagger (x)(\{ 0\} ) \text{ then } 0 \text{ else } 1\).
The expression of the lemma statement is obtained by replacing \(P_\xi ^\dagger \) by its value (Lemma 1.2.1).
By definition,
If we restrict the infimum to deterministic kernels corresponding to functions of the form \(x \mapsto \mathbb {I}\{ x \in E\} \), we get that \(\mathcal B_\xi (\mu , \nu )\) is bounded from above by the expression for which we want to show equality.
Since the Bayes estimator has that shape (Lemma 1.2.10), \(\mathcal B_\xi (\mu , \nu )\) is also greater than the infimum and we have equality.
The Bayes risk of simple binary hypothesis testing for prior \(\xi \in \mathcal M(\{ 0,1\} )\) is
By Theorem 1.1.11, the generalized Bayes estimator is a Bayes estimator, hence it suffices to compute its Bayesian risk. By Lemma 1.1.10,
The last line is obtained by replacing \(P_\xi ^\dagger \) by its value (Lemma 1.2.1).
Use Theorem 1.2.12 and the equality \(\min \{ a,b\} = \frac{1}{2}(a + b - \vert a - b \vert )\).
Let \(\hat{y}_B\) be the generalized Bayes estimator for simple binary hypothesis testing. The distribution \(\ell \circ (\mathrm{id} \parallel \hat{y}_B) \circ (\pi \otimes P)\) (in which \(\ell \) stands for the associated deterministic kernel) is a Bernoulli with mean \(\mathcal B_\pi (\mu , \nu )\).
\(\ell \) takes values in \(\{ 0,1\} \) so the law has to be Bernoulli. It has mean \(\mathcal B_\pi (\mu , \nu )\) because the expected risk of the generalized Bayes estimator is \(\mathcal B_\pi (\mu , \nu )\).
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of the Bayes binary risk.
1.3 Reduction to testing
TODO: lower bounds on estimation by reduction to testing problems.
1.4 Sample complexity
TODO: we could generalize this to a sequence of kernels instead of iid observations.
An estimation problem of the shape \((P^{\otimes n}, y, \ell ')\) corresponds to estimating from \(n \in \mathbb {N}\) i.i.d. samples of the distribution. From the data-processing inequality, we get that the risk (Bayes risk, minimax risk...) decreases with the number of samples (Lemma 1.4.1 for example). Intuitively, having more information allows for better estimation.
In a sequence of estimation tasks \((P^{\otimes n}, y, \ell ')\) for \(n \in \mathbb {N}\), the sample complexity is the minimal \(n\) such that the risk is less than a desired value.
If \(n \le m\) then \(\mathcal R_\pi ^{P^{\otimes n}} \ge \mathcal R_\pi ^{P^{\otimes m}}\).
Use the data-processing inequality, Theorem 1.1.4, for the deterministic kernel that projects on the first \(n\) coordinates.
The sample complexity of Bayesian estimation with respect to a prior \(\pi \in \mathcal M(\Theta )\) at risk level \(\delta \in \mathbb {R}_{+,\infty }\) is