3 Mutual information

3.1 Mutual information

Definitions

Definition 31

The mutual information is, for \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\) ,

\begin{align*} I(\rho ) = \operatorname{KL}(\rho , \rho _X \times \rho _Y) \: . \end{align*}

Definition 32

Let \(\kappa : \mathcal Z \rightsquigarrow \mathcal X \times \mathcal Y\). The conditional mutual information of \(\kappa \) with respect to \(\nu \in \mathcal M(\mathcal Z)\) is

\begin{align*} I(\kappa \mid \nu ) = \operatorname{KL}(\kappa , \kappa _X \times \kappa _Y \mid \nu ) \: . \end{align*}

Properties

Lemma 3.1.1

For \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel,

\begin{align*} I(\mu \otimes \kappa ) = KL(\mu \otimes \kappa , \mu \times (\kappa \circ \mu )) = KL(\kappa , \kappa \circ \mu \mid \mu ) \: , \end{align*}

where in the conditional divergence the measure \(\kappa \circ \mu \) should be understood as the constant kernel from \(\mathcal X\) to \(\mathcal Y\) with that value.

Proof ▶

The first equality is the definition of \(I\), together with \((\mu \otimes \kappa )_X = \mu \) (since \(\kappa \) is Markov) and \((\mu \otimes \kappa )_Y = \kappa \circ \mu \). The second equality is from \(KL(\kappa , \eta \mid \mu ) = KL(\mu \otimes \kappa , \mu \otimes \eta )\). (TODO: now from a lemma, soon by definition).

Lemma 3.1.2

\(I(\rho _\leftrightarrow ) = I(\rho )\) .

Proof ▶

Lemma 3.1.3

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then

\begin{align*} I(\pi _\alpha \otimes P) = \operatorname{JS}_\alpha (\mu , \nu ) \: . \end{align*}

Proof ▶

Use Lemma 3.1.1 and Lemma 2.9.1.

Theorem 3.1.4 Data-processing

For \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Y'\) two Markov kernels,

\begin{align*} I((\kappa \parallel \eta ) \circ \rho ) \le I(\rho ) \: . \end{align*}

Proof ▶

\(((\kappa \parallel \eta ) \circ \rho )_{X'} = \kappa \circ \rho _X\) and \(((\kappa \parallel \eta ) \circ \rho )_{Y'} = \eta \circ \rho _Y\) hence

\begin{align*} ((\kappa \parallel \eta ) \circ \rho )_{X'} \times ((\kappa \parallel \eta ) \circ \rho )_{Y'} & = (\kappa \circ \rho _X) \times (\eta \circ \rho _Y) = (\kappa \parallel \eta ) \circ (\rho _X \times \rho _Y) \: . \end{align*}

By the data-processing inequality for \(\operatorname{KL}\) (Theorem 2.4.6),

\begin{align*} I((\kappa \parallel \eta ) \circ \rho ) & = \operatorname{KL}((\kappa \parallel \eta ) \circ \rho , (\kappa \parallel \eta ) \circ (\rho _X \times \rho _Y)) \\ & \le \operatorname{KL}(\rho , \rho _X \times \rho _Y) \\ & = I(\rho ) \: . \end{align*}

Corollary 3.1.5

For \(\mu \in \mathcal X\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) two Markov kernels,

\begin{align*} I(\mu \otimes (\eta \circ \kappa )) \le I(\mu \otimes \kappa ) \: . \end{align*}

That corollary corresponds to the following situation: 3 random variables \(X, Y, Z\) have the dependency structure \(X \to Y \to Z\) and \(\mu \) is the law of \(X\), \(\mu \otimes \kappa \) is the law of \((X, Y)\) and \(\mu \otimes (\eta \circ \kappa )\) is the law of \((X, Z)\). With the usual \(I(X, Y)\) notation, the corollary statement is \(I(X, Z) \le I(X, Y)\) .

Proof ▶

First, rewrite \(\mu \otimes (\eta \circ \kappa ) = (\mathrm{id} \parallel \eta ) \circ (\mu \otimes \kappa )\). Then apply Theorem 3.1.4.

3.2 New definitions related to the mutual information

Definition 33

Let \(D\) be a divergence between measures. The left \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is

\begin{align*} I_D^L(\mu , \kappa ) = \inf _{\xi \in \mathcal P(\mathcal Y)} D(\mu \times \xi , \mu \otimes \kappa ) \: . \end{align*}

Definition 34

Let \(D\) be a divergence between measures. The right \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is

\begin{align*} I_D^R(\mu , \kappa ) = \inf _{\xi \in \mathcal P(\mathcal Y)} D(\mu \otimes \kappa , \mu \times \xi ) \: . \end{align*}

Lemma 3.2.1

For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),

\begin{align*} I_{\operatorname{KL}}^L(\mu , \kappa ) = (1 - \mu _0) R_{\mu _0}(\kappa _0, \kappa _1) \: . \end{align*}

Proof ▶

Lemma 3.2.2

For \(\mu \in \mathcal P(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),

\begin{align*} I_{\operatorname{KL}}^R(\mu , \kappa ) = I(\mu \otimes \kappa ) \: . \end{align*}

Proof ▶

Corollary 3.2.3

For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),

\begin{align*} I_{\operatorname{KL}}^R(\mu , \kappa ) = \operatorname{JS}_{\mu _0}(\kappa _0, \kappa _1) \: . \end{align*}

Proof ▶

Data-processing inequality

Theorem 3.2.4

If the divergence \(D\) satisfies the data-processing inequality, then for all \(\mu \in \mathcal M(\mathcal X)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and all Markov kernels \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) then

\begin{align*} I_D^L(\mu , \eta \circ \kappa ) \le I_D^L(\mu , \kappa ) \: , \\ I_D^R(\mu , \eta \circ \kappa ) \le I_D^R(\mu , \kappa ) \: . \end{align*}

Proof ▶

Lemma 3.2.5

Let \(\pi \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) of an estimation task with kernel \(P\) takes values in \([0,1]\). Then

\begin{align*} I_{D_f}^L(\pi , P) & \ge d_f(\mathcal R_\pi ^{d_\Theta }, \mathcal R_\pi ^P) \: , \\ I_{D_f}^R(\pi , P) & \ge d_f(\mathcal R_\pi ^P, \mathcal R_\pi ^{d_\Theta }) \: . \end{align*}

Proof ▶

The left mutual information \(I_D^L(\pi , P)\) is an infimum: \(I_D^L(\pi , P) = \inf _{\xi \in \mathcal P(\mathcal X)} D(\pi \times \xi , \pi \otimes P)\). For all \(\xi \in \mathcal P(\mathcal X)\), by Lemma 2.3.63,

\begin{align*} D_f(\pi \times \xi , \pi \otimes P) & \ge d_f(\mathcal R_\pi ^{\theta \mapsto \xi }, \mathcal R_\pi ^P) \: . \end{align*}

The risk of a constant Markov kernel is equal to the risk of the discard kernel \(d_\Theta \) (TODO ref).

The proof for the right mutual information is similar.