1 Testing, Estimation, Risk
In the estimation problem, we are given a family of measures and we want to estimate an unknown parameter for the measure from which we observe samples. Let be the measurable space on which the measures are defined. The measure family is described by a kernel . We write for . In parametric problems, is a low-dimensional space. In nonparametric problems is large, for example or .
The goal of the estimation task is to find, given the observation of samples of for an unknown , the value of a function , which maps the parameter to the quantity we are interested in. For example, might be the expected value of the distribution .
An estimator is a Markov kernel , where most often . Note that, while is a function of a parameter in , has domain as the estimation is done based on a sample of the measure, while the measure itself and its parameter remain unknown. In general this is a randomized estimator, but often deterministic estimators (corresponding to deterministic kernels) are enough to attain optimal performance. The quality of the estimation is quantified by a loss . We write for a loss on and in the heterogeneous case.
In short, an estimation problem is specified by the data .
Example: simple binary hypothesis testing (SBHT)
We say that an estimation problem is a testing problem if is discrete and the loss takes values in . A test is said to be binary if has two elements. A test is simple if is a singleton for all , i.e. is bijective.
In summary, in simple binary hypothesis testing, , , (that is, is the identity). For , .
1.1 Risk
Let be the law of . That is, is the kernel .
Definition
1
Risk
The risk of an estimator on the estimation problem at is .
Example (SBHT): .
Definition
2
Bayesian risk
The Bayesian risk of an estimator on for a prior is . It can also be expanded as .
If we write and for their deterministic kernels, the Bayesian risk of is the mean of the measure .
Definition
3
Bayes risk
The Bayes risk of for prior is , where the infimum is over Markov kernels.
The Bayes risk of is
Example (SBHT): .
Definition
4
Bayes estimator
An estimator is said to be a Bayes estimator for a prior if .
Definition
5
Minimax risk
Example (SBHT): .
Proof
▶
For any and any estimator, .
TODO: “often”, .
1.1.1 Properties of the Bayes risk of a prior
Lemma
1.1.2
The Bayes risk of a prior on with a Markov kernel satisfies
Proof
▶
The infimum over all Markov kernels in the definition of the Bayes risk of is bounded from above by the infimum restricted to constant, deterministic Markov kernels.
Lemma
1.1.3
The Bayes risk of a prior on with a constant Markov kernel is
In particular, it does not depend on .
Proof
▶
Let be the measure such that for all .
Data-processing inequality and consequences
Theorem
1.1.4
Data-processing inequality
For and a Markov kernel, (where the estimation problems differ only in the kernel).
Proof
▶
The risk is
The inequality is due to taking an infimum over a larger set: .
Lemma
1.1.5
For and a Markov kernel, .
Proof
▶
Use Theorem 1.1.4: is the composition of and the deterministic kernel that forgets the second value.
Lemma
1.1.6
For and a Markov kernel, , in which is the kernel obtained by marginalizing over in the output of .
Proof
▶
Use Theorem 1.1.4: is the composition of and the deterministic kernel that forgets the first value.
Lemma
1.1.7
The Bayes risk is concave in .
Proof
▶
The infimum of a sum is larger than the sum of the infimums:
TODO: can we build kernels such that this lemma is a consequence of one of the lemmas above?
Generalized Bayes estimator
Lemma
1.1.8
The Bayesian risk of a Markov kernel with respect to a prior on satisfies
whenever the Bayesian inverse of with respect to exists (Definition 48).
Proof
▶
Use the main property of the Bayesian inverse.
Lemma
1.1.9
The Bayesian risk of a Markov kernel with respect to a prior on satisfies
Proof
▶
Starting from the equality of Lemma 1.1.8, we get
Definition
6
Generalized Bayes estimator
The generalized Bayes estimator for prior on is the deterministic estimator given by
if there exists such a measurable argmin.
Lemma
1.1.10
The Bayesian risk of the generalized Bayes estimator is
Proof
▶
Start from the equality of Lemma 1.1.8 and use the definition of the generalized Bayes estimator.
Theorem
1.1.11
When the generalized Bayes estimator is well defined, it is a Bayes estimator. The value of the Bayes risk with respect to the prior is then
Proof
▶
By Lemma 1.1.9, the Bayesian risk of the generalized Bayes estimator obtained in Lemma 1.1.10 is a lower bound for the Bayesian risk.
Lemma
1.1.12
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior is
Lemma
1.1.13
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior for , and is
Lemma
1.1.14
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior for , and is
When is a probability measure and is a Markov kernel, .
Lemma
1.1.15
The generalized Bayes estimator for prior on the estimation problem defined by , and is
Lemma
1.1.16
Suppose that is finite and let . The Bayes risk with respect to the prior for , , a Markov kernel and satisfies
1.1.2 Bayes risk increase
TODO: remove this section? It’s not used for anything now.
The following is a new definition, which is a natural generalization of the DeGroot statistical information.
Definition
7
The Bayes risk increase of a kernel with respect to the estimation problem and the prior is the difference of the Bayes risk of and that of . That is,
In particular, if we take equal to the Markov kernel to the point space (denoted by , for “discard” kernel), we get , where if is a Markov kernel. We recover a difference between the risk of the best constant kernel and the Bayes estimator, as in the DeGroot statistical information.
The risk increase quantifies how much risk we accrue by forgetting information due to the composition with .
Lemma
1.1.18
For and two Markov kernels, .
Lemma
1.1.19
Data-processing inequality
For any measurable space , let be the Markov kernel to the point space. For all Markov kernels ,
Proof
▶
By Lemma 1.1.18, then Lemma 1.1.17,
Finally, since is a Markov kernel.
1.2 Simple binary hypothesis testing
In this section, for a measure on , we write for and for . We sometimes write for the measure on with and .
Lemma
1.2.1
The Bayesian inverse of a kernel with respect to a prior is (almost surely w.r.t. ).
Lemma
1.2.2
For , the Bayes risk of a prior is
Proof
▶
Use Theorem 1.1.11, with the value of given by Lemma 1.2.1.
1.2.1 Bayes risk of a prior
Definition
8
The Bayes binary risk between measures and with respect to prior , denoted by , is the Bayes risk for , , the kernel sending 0 to and 1 to and prior . That is,
in which the infimum is over Markov kernels.
If the prior is a probability measure with weights , we write .
Thanks to the following lemma, we can prove properties of the Bayes binary risk for non-zero finite measures by considering only probability measures.
Alternatively, we can reduce the study of the Bayes binary risk for any prior to the study of the prior , for which that risk is related to the total variation distance (defined later).
Theorem
1.2.5
Data-processing inequality
For and a Markov kernel, .
Proof
▶
It is an infimum of non-negative values.
Proof
▶
Let be the discard kernel and be the only probability measure over , that is a Dirac delta. Then applying the DPI (Theorem 1.2.5), followed by Lemma 1.2.3 and Lemma 1.2.6, we obtain:
Lemma
1.2.9
For and , where is such that and . For , .
Lemma
1.2.10
The generalized Bayes estimator for the Bayes binary risk with prior is , i.e. it is equal to for .
Proof
▶
The generalized Bayes estimator is defined by
For the simple binary hypothesis testing problem,
For , this is and for , this is . The generalized Bayes estimator is .
The expression of the lemma statement is obtained by replacing by its value (Lemma 1.2.1).
Proof
▶
By definition,
If we restrict the infimum to deterministic kernels corresponding to functions of the form , we get that is bounded from above by the expression for which we want to show equality.
Since the Bayes estimator has that shape (Lemma 1.2.10), is also greater than the infimum and we have equality.
Theorem
1.2.12
The Bayes risk of simple binary hypothesis testing for prior is
Proof
▶
By Theorem 1.1.11, the generalized Bayes estimator is a Bayes estimator, hence it suffices to compute its Bayesian risk. By Lemma 1.1.10,
The last line is obtained by replacing by its value (Lemma 1.2.1).
Proof
▶
Use Theorem 1.2.12 and the equality .
Lemma
1.2.14
Let be the generalized Bayes estimator for simple binary hypothesis testing. The distribution (in which stands for the associated deterministic kernel) is a Bernoulli with mean .
Proof
▶
takes values in so the law has to be Bernoulli. It has mean because the expected risk of the generalized Bayes estimator is .
TODO: find a way to hide the following dummy lemma
Lemma
1.2.15
Dummy lemma: bayesBinaryRisk properties
Dummy node to summarize properties of the Bayes binary risk.
1.3 Reduction to testing
TODO: lower bounds on estimation by reduction to testing problems.
1.4 Sample complexity
TODO: we could generalize this to a sequence of kernels instead of iid observations.
An estimation problem of the shape corresponds to estimating from i.i.d. samples of the distribution. From the data-processing inequality, we get that the risk (Bayes risk, minimax risk...) decreases with the number of samples (Lemma 1.4.1 for example). Intuitively, having more information allows for better estimation.
In a sequence of estimation tasks for , the sample complexity is the minimal such that the risk is less than a desired value.
Proof
▶
Use the data-processing inequality, Theorem 1.1.4, for the deterministic kernel that projects on the first coordinates. □
Definition
9
The sample complexity of Bayesian estimation with respect to a prior at risk level is