Lower bounds for hypothesis testing based on information theory

3 Mutual information

3.1 Mutual information

Definitions

Definition 31

The mutual information is, for \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\) ,

\begin{align*} I(\rho ) = \operatorname{KL}(\rho , \rho _X \times \rho _Y) \: . \end{align*}
Definition 32

Let \(\kappa : \mathcal Z \rightsquigarrow \mathcal X \times \mathcal Y\). The conditional mutual information of \(\kappa \) with respect to \(\nu \in \mathcal M(\mathcal Z)\) is

\begin{align*} I(\kappa \mid \nu ) = \operatorname{KL}(\kappa , \kappa _X \times \kappa _Y \mid \nu ) \: . \end{align*}

Properties

For \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel,

\begin{align*} I(\mu \otimes \kappa ) = KL(\mu \otimes \kappa , \mu \times (\kappa \circ \mu )) = KL(\kappa , \kappa \circ \mu \mid \mu ) \: , \end{align*}

where in the conditional divergence the measure \(\kappa \circ \mu \) should be understood as the constant kernel from \(\mathcal X\) to \(\mathcal Y\) with that value.

Proof

The first equality is the definition of \(I\), together with \((\mu \otimes \kappa )_X = \mu \) (since \(\kappa \) is Markov) and \((\mu \otimes \kappa )_Y = \kappa \circ \mu \). The second equality is from \(KL(\kappa , \eta \mid \mu ) = KL(\mu \otimes \kappa , \mu \otimes \eta )\). (TODO: now from a lemma, soon by definition).

Lemma 3.1.2

\(I(\rho _\leftrightarrow ) = I(\rho )\) .

Proof

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then

\begin{align*} I(\pi _\alpha \otimes P) = \operatorname{JS}_\alpha (\mu , \nu ) \: . \end{align*}
Proof

Use Lemma 3.1.1 and Lemma 2.9.1.

Theorem 3.1.4 Data-processing

For \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Y'\) two Markov kernels,

\begin{align*} I((\kappa \parallel \eta ) \circ \rho ) \le I(\rho ) \: . \end{align*}
Proof

\(((\kappa \parallel \eta ) \circ \rho )_{X'} = \kappa \circ \rho _X\) and \(((\kappa \parallel \eta ) \circ \rho )_{Y'} = \eta \circ \rho _Y\) hence

\begin{align*} ((\kappa \parallel \eta ) \circ \rho )_{X'} \times ((\kappa \parallel \eta ) \circ \rho )_{Y'} & = (\kappa \circ \rho _X) \times (\eta \circ \rho _Y) = (\kappa \parallel \eta ) \circ (\rho _X \times \rho _Y) \: . \end{align*}

By the data-processing inequality for \(\operatorname{KL}\) (Theorem 2.4.6),

\begin{align*} I((\kappa \parallel \eta ) \circ \rho ) & = \operatorname{KL}((\kappa \parallel \eta ) \circ \rho , (\kappa \parallel \eta ) \circ (\rho _X \times \rho _Y)) \\ & \le \operatorname{KL}(\rho , \rho _X \times \rho _Y) \\ & = I(\rho ) \: . \end{align*}
Corollary 3.1.5

For \(\mu \in \mathcal X\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) two Markov kernels,

\begin{align*} I(\mu \otimes (\eta \circ \kappa )) \le I(\mu \otimes \kappa ) \: . \end{align*}

That corollary corresponds to the following situation: 3 random variables \(X, Y, Z\) have the dependency structure \(X \to Y \to Z\) and \(\mu \) is the law of \(X\), \(\mu \otimes \kappa \) is the law of \((X, Y)\) and \(\mu \otimes (\eta \circ \kappa )\) is the law of \((X, Z)\). With the usual \(I(X, Y)\) notation, the corollary statement is \(I(X, Z) \le I(X, Y)\) .

Proof

First, rewrite \(\mu \otimes (\eta \circ \kappa ) = (\mathrm{id} \parallel \eta ) \circ (\mu \otimes \kappa )\). Then apply Theorem 3.1.4.

3.2 New definitions related to the mutual information

Definition 33

Let \(D\) be a divergence between measures. The left \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is

\begin{align*} I_D^L(\mu , \kappa ) = \inf _{\xi \in \mathcal P(\mathcal Y)} D(\mu \times \xi , \mu \otimes \kappa ) \: . \end{align*}
Definition 34

Let \(D\) be a divergence between measures. The right \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is

\begin{align*} I_D^R(\mu , \kappa ) = \inf _{\xi \in \mathcal P(\mathcal Y)} D(\mu \otimes \kappa , \mu \times \xi ) \: . \end{align*}

For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),

\begin{align*} I_{\operatorname{KL}}^L(\mu , \kappa ) = (1 - \mu _0) R_{\mu _0}(\kappa _0, \kappa _1) \: . \end{align*}
Proof
Lemma 3.2.2

For \(\mu \in \mathcal P(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),

\begin{align*} I_{\operatorname{KL}}^R(\mu , \kappa ) = I(\mu \otimes \kappa ) \: . \end{align*}
Proof

For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),

\begin{align*} I_{\operatorname{KL}}^R(\mu , \kappa ) = \operatorname{JS}_{\mu _0}(\kappa _0, \kappa _1) \: . \end{align*}
Proof

Data-processing inequality

Theorem 3.2.4

If the divergence \(D\) satisfies the data-processing inequality, then for all \(\mu \in \mathcal M(\mathcal X)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and all Markov kernels \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) then

\begin{align*} I_D^L(\mu , \eta \circ \kappa ) \le I_D^L(\mu , \kappa ) \: , \\ I_D^R(\mu , \eta \circ \kappa ) \le I_D^R(\mu , \kappa ) \: . \end{align*}
Proof

Let \(\pi \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) of an estimation task with kernel \(P\) takes values in \([0,1]\). Then

\begin{align*} I_{D_f}^L(\pi , P) & \ge d_f(\mathcal R_\pi ^{d_\Theta }, \mathcal R_\pi ^P) \: , \\ I_{D_f}^R(\pi , P) & \ge d_f(\mathcal R_\pi ^P, \mathcal R_\pi ^{d_\Theta }) \: . \end{align*}
Proof

The left mutual information \(I_D^L(\pi , P)\) is an infimum: \(I_D^L(\pi , P) = \inf _{\xi \in \mathcal P(\mathcal X)} D(\pi \times \xi , \pi \otimes P)\). For all \(\xi \in \mathcal P(\mathcal X)\), by Lemma 2.3.63,

\begin{align*} D_f(\pi \times \xi , \pi \otimes P) & \ge d_f(\mathcal R_\pi ^{\theta \mapsto \xi }, \mathcal R_\pi ^P) \: . \end{align*}

The risk of a constant Markov kernel is equal to the risk of the discard kernel \(d_\Theta \) (TODO ref).

The proof for the right mutual information is similar.