Lower bounds for hypothesis testing based on information theory

1 Testing, Estimation, Risk

In the estimation problem, we are given a family of measures and we want to estimate an unknown parameter for the measure from which we observe samples. Let X be the measurable space on which the measures are defined. The measure family is described by a kernel P:ΘX. We write Pθ for P(θ). In parametric problems, Θ is a low-dimensional space. In nonparametric problems Θ is large, for example Θ=M(X) or Θ=P(X).

The goal of the estimation task is to find, given the observation of samples of Pθ for an unknown θ, the value of a function y:ΘY , which maps the parameter θ to the quantity we are interested in. For example, y(θ) might be the expected value of the distribution Pθ.

An estimator is a Markov kernel y^:XZ, where most often Z=Y. Note that, while y is a function of a parameter in Θ, y^ has domain X as the estimation is done based on a sample of the measure, while the measure itself and its parameter remain unknown. In general this is a randomized estimator, but often deterministic estimators (corresponding to deterministic kernels) are enough to attain optimal performance. The quality of the estimation is quantified by a loss :Y×ZR+,. We write for a loss on Y×Y and in the heterogeneous case.

In short, an estimation problem is specified by the data (P,y,).

Example: simple binary hypothesis testing (SBHT)

We say that an estimation problem is a testing problem if Y=Z is discrete and the loss takes values in {0,1}. A test is said to be binary if Y has two elements. A test is simple if {θy(θ)=y0} is a singleton for all y0Y, i.e. y is bijective.

In summary, in simple binary hypothesis testing, Θ=Y=Z={0,1}, y(0)=0, y(1)=1 (that is, y is the identity). For y0,z{0,1}, (y0,z)=I{y0z}.

1.1 Risk

Let μ^θ be the law of y^(θ). That is, μ^:ΘX is the kernel y^P.

Definition 1 Risk
#

The risk of an estimator y^ on the estimation problem (P,y,) at θΘ is rθP(y^)=(y^P)(θ)[z(y(θ),z)] .

Example (SBHT): rθP(y^)=μ^θ(y^(θ)y(θ)).

Definition 2 Bayesian risk
#

The Bayesian risk of an estimator y^ on (P,y,) for a prior πM(Θ) is RπP(y^)=π[θrθP(y^)] . It can also be expanded as RπP(y^)=(π(y^P))[(θ,z)(y(θ),z)] .

If we write and y for their deterministic kernels, the Bayesian risk of y^ is the mean of the measure (yy^)(πP) .

Definition 3 Bayes risk
#

The Bayes risk of (P,y,) for prior πM(Θ) is RπP=infy^:XZRπP(y^) , where the infimum is over Markov kernels.

The Bayes risk of (P,y,) is RB=supπP(Θ)RπP.

Example (SBHT): RB=supγ[0,1]infy^(γμ^0(y^0)+(1γ)μ^1(y^1)).

Definition 4 Bayes estimator
#

An estimator y^ is said to be a Bayes estimator for a prior πP(Θ) if RπP(y^)=RπP.

Definition 5 Minimax risk
#

The minimax risk of (P,y,) is R=infy^:XZsupθΘrθP(y^) .

Example (SBHT): R=infy^max{μ^0(y^0),μ^1(y^1)}.

Lemma 1.1.1

RBR.

Proof

TODO: “often”, RB=R.

1.1.1 Properties of the Bayes risk of a prior

Lemma 1.1.2

The Bayes risk of a prior πM(Θ) on (P,y,) with P a Markov kernel satisfies

RπPinfzZπ[θ(y(θ),z)].
Proof
Lemma 1.1.3

The Bayes risk of a prior πM(Θ) on (P,y,) with P a constant Markov kernel is

RπP=infzZπ[θ(y(θ),z)].

In particular, it does not depend on P.

Proof

Data-processing inequality and consequences

Theorem 1.1.4 Data-processing inequality

For P:ΘX and κ:XX a Markov kernel, RπκPRπP (where the estimation problems differ only in the kernel).

Proof

For P:ΘX and κ:Θ×XX a Markov kernel, RπPκRπP .

Proof
Lemma 1.1.6

For P:ΘX and κ:Θ×XX a Markov kernel, RπPκRπ(Pκ)X , in which (Pκ)X:ΘX is the kernel obtained by marginalizing over X in the output of Pκ .

Proof
Lemma 1.1.7

The Bayes risk RπP is concave in P:ΘX .

Proof

Generalized Bayes estimator

The Bayesian risk of a Markov kernel y^:XZ with respect to a prior πM(Θ) on (P,y,) satisfies

RπP(y^)=((Pπ×y^)Pπ)[(θ,z)(y(θ),z)],

whenever the Bayesian inverse Pπ of P with respect to π exists (Definition 48).

Proof

The Bayesian risk of a Markov kernel y^:XZ with respect to a prior πM(Θ) on (P,y,) satisfies

RπP(y^)(Pπ)[xinfzZPπ(x)[θ(y(θ),z)]].
Proof
Definition 6 Generalized Bayes estimator
#

The generalized Bayes estimator for prior πP(Θ) on (P,y,) is the deterministic estimator XZ given by

xargminzPπ(x)[θ(y(θ),z)],

if there exists such a measurable argmin.

The Bayesian risk of the generalized Bayes estimator y^B is

RπP(y^B)=(Pπ)[xinfzZPπ(x)[θ(y(θ),z)]].
Proof

When the generalized Bayes estimator is well defined, it is a Bayes estimator. The value of the Bayes risk with respect to the prior πM(Θ) is then

RπP=(Pπ)[xinfzZPπ(x)[θ(y(θ),z)]].
Proof

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior πM(Θ) is

RπP=(Pπ)[xinfzZπ[θdPθd(Pπ)(x)(y(θ),z)]].
Proof

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior πM(Θ) for Y=Z=Θ, y=id and =I{θz} is

RπP=(Pπ)[1](Pπ)[xsupzZPπ(x){z}].
Proof

When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior πM(Θ) for Y=Z=Θ, y=id and =I{θz} is

RπP=(Pπ)[1](Pπ)[xsupzZdPzd(Pπ)(x)π{z}].

When π is a probability measure and P is a Markov kernel, (Pπ)[1]=1.

Proof
Lemma 1.1.15

The generalized Bayes estimator for prior πP(Θ) on the estimation problem defined by Y=Z=Θ, y=id and =I{θz} is

xargmaxz(π{z}dPzd(Pπ)(x)).
Proof

Suppose that Θ is finite and let ξP(Θ). The Bayes risk with respect to the prior πP(Θ) for Y=Z=Θ, y=id, P a Markov kernel and =I{θz} satisfies

RπP1(θΘπθξθ)(Pπ)[xθΘ(dPθd(Pπ)(x))ξθ].
Proof

1.1.2 Bayes risk increase

TODO: remove this section? It’s not used for anything now.

The following is a new definition, which is a natural generalization of the DeGroot statistical information.

Definition 7
#

The Bayes risk increase IπP(κ) of a kernel κ:XX with respect to the estimation problem (P,y,) and the prior πM(Θ) is the difference of the Bayes risk of (κP,y,) and that of (P,y,). That is,

IπP(κ)=RπκPRπP=infy^:XZ(π(y^κP))[(θ,z)(y(θ),z)]infy^:XZ(π(y^P))[(θ,z)(y(θ),z)].

In particular, if we take κ:X equal to the Markov kernel to the point space (denoted by dX, for “discard” kernel), we get IπP(dX)=RπdXPRπP, where dXP=dΘ if P is a Markov kernel. We recover a difference between the risk of the best constant kernel and the Bayes estimator, as in the DeGroot statistical information.

The risk increase quantifies how much risk we accrue by forgetting information due to the composition with κ.

Lemma 1.1.17

For κ a Markov kernel, IπP(κ)0 .

Proof
Lemma 1.1.18

For κ:XX and η:XX two Markov kernels, IπP(ηκ)=IπP(κ)+IπκP(η) .

Proof
Lemma 1.1.19 Data-processing inequality

For any measurable space X, let dX:X be the Markov kernel to the point space. For all Markov kernels κ:XX,

IπP(dX)IπκP(dX).
Proof

1.2 Simple binary hypothesis testing

In this section, for a measure ξ on {0,1}, we write ξ0 for ξ({0}) and ξ1 for ξ({1}) . We sometimes write (a,b) for the measure ξ on {0,1} with ξ0=a and ξ1=b.

Lemma 1.2.1

The Bayesian inverse of a kernel P:{0,1}X with respect to a prior ξM({0,1}) is Pξ(x)=(ξ0dP(0)d(Pξ)(x),ξ1dP(1)d(Pξ)(x)) (almost surely w.r.t. Pξ=ξ0P(0)+ξ1P(1)).

Proof

For Θ={0,1}, the Bayes risk of a prior ξM({0,1}) is

RξP=(Pξ)[xinfzZ(ξ0dP(0)d(Pξ)(x)(y(0),z)+ξ1dP(1)d(Pξ)(x)(y(1),z))].
Proof

1.2.1 Bayes risk of a prior

Definition 8
#

The Bayes binary risk between measures μ and ν with respect to prior ξM({0,1}), denoted by Bξ(μ,ν), is the Bayes risk RξP for Θ=Y=Z={0,1}, (y,z)=I{yz}, P the kernel sending 0 to μ and 1 to ν and prior ξ. That is,

Bξ(μ,ν)=infy^:X{0,1}(ξ0(y^μ)({1})+ξ1(y^ν)({0})),

in which the infimum is over Markov kernels.

If the prior is a probability measure with weights (π,1π), we write Bπ(μ,ν)=B(π,1π)(μ,ν) .

Thanks to the following lemma, we can prove properties of the Bayes binary risk for non-zero finite measures by considering only probability measures.

Lemma 1.2.3

For all a,b>0, Bξ(μ,ν)=B(aξ0,bξ1)(a1μ,b1ν) .

Proof

Alternatively, we can reduce the study of the Bayes binary risk for any prior to the study of the prior (1,1), for which that risk is related to the total variation distance (defined later).

Bξ(μ,ν)=B(1,1)(ξ0μ,ξ1ν) .

Proof
Theorem 1.2.5 Data-processing inequality

For μ,νM(X) and κ:XY a Markov kernel, Bξ(κμ,κν)Bξ(μ,ν) .

Proof
Lemma 1.2.6
#

For μM(X), Bξ(μ,μ)=min{ξ0,ξ1}μ(X) .

Proof
Lemma 1.2.7

For all measures μ,ν, Bξ(μ,ν)0 .

Proof

For all measures μ,ν, Bξ(μ,ν)min{ξ0μ(X),ξ1ν(X)} .

Proof
Lemma 1.2.9
#

For μ,νM(X) and ξM({0,1}), Bξ(μ,ν)=Bξ(ν,μ) where ξM({0,1}) is such that ξ({0})=ξ1 and ξ({1})=ξ0. For π[0,1], Bπ(μ,ν)=B1π(ν,μ) .

Proof

The generalized Bayes estimator for the Bayes binary risk with prior ξM({0,1}) is xif ξ1dνd(Pξ)(x)ξ0dμd(Pξ)(x) then 0 else 1, i.e. it is equal to IE for E={xξ1dνd(Pξ)(x)>ξ0dμd(Pξ)(x)} .

Proof
Bξ(μ,ν)=infE event(ξ0μ(E)+ξ1ν(Ec)).
Proof

The Bayes risk of simple binary hypothesis testing for prior ξM({0,1}) is

Bξ(μ,ν)=(Pξ)[xmin{ξ0dμd(Pξ)(x),ξ1dνd(Pξ)(x)}].
Proof
Corollary 1.2.13
Bξ(μ,ν)=12((Pξ)(X)(Pξ)[x|ξ0dμd(Pξ)(x)ξ1dνd(Pξ)(x)|]).
Proof
Lemma 1.2.14

Let y^B be the generalized Bayes estimator for simple binary hypothesis testing. The distribution (idy^B)(πP) (in which stands for the associated deterministic kernel) is a Bernoulli with mean Bπ(μ,ν).

Proof

TODO: find a way to hide the following dummy lemma

Dummy node to summarize properties of the Bayes binary risk.

Proof

1.3 Reduction to testing

TODO: lower bounds on estimation by reduction to testing problems.

1.4 Sample complexity

TODO: we could generalize this to a sequence of kernels instead of iid observations.

An estimation problem of the shape (Pn,y,) corresponds to estimating from nN i.i.d. samples of the distribution. From the data-processing inequality, we get that the risk (Bayes risk, minimax risk...) decreases with the number of samples (Lemma 1.4.1 for example). Intuitively, having more information allows for better estimation.

In a sequence of estimation tasks (Pn,y,) for nN, the sample complexity is the minimal n such that the risk is less than a desired value.

Lemma 1.4.1

If nm then RπPnRπPm.

Proof
Definition 9

The sample complexity of Bayesian estimation with respect to a prior πM(Θ) at risk level δR+, is

nπP(δ)=min{nNRπPnδ}.