3 Mutual information
3.1 Mutual information
Definitions
The mutual information is, for \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\) ,
Let \(\kappa : \mathcal Z \rightsquigarrow \mathcal X \times \mathcal Y\). The conditional mutual information of \(\kappa \) with respect to \(\nu \in \mathcal M(\mathcal Z)\) is
Properties
For \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel,
where in the conditional divergence the measure \(\kappa \circ \mu \) should be understood as the constant kernel from \(\mathcal X\) to \(\mathcal Y\) with that value.
The first equality is the definition of \(I\), together with \((\mu \otimes \kappa )_X = \mu \) (since \(\kappa \) is Markov) and \((\mu \otimes \kappa )_Y = \kappa \circ \mu \). The second equality is from \(KL(\kappa , \eta \mid \mu ) = KL(\mu \otimes \kappa , \mu \otimes \eta )\). (TODO: now from a lemma, soon by definition).
\(I(\rho _\leftrightarrow ) = I(\rho )\) .
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
For \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Y'\) two Markov kernels,
\(((\kappa \parallel \eta ) \circ \rho )_{X'} = \kappa \circ \rho _X\) and \(((\kappa \parallel \eta ) \circ \rho )_{Y'} = \eta \circ \rho _Y\) hence
By the data-processing inequality for \(\operatorname{KL}\) (Theorem 2.4.6),
For \(\mu \in \mathcal X\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) two Markov kernels,
That corollary corresponds to the following situation: 3 random variables \(X, Y, Z\) have the dependency structure \(X \to Y \to Z\) and \(\mu \) is the law of \(X\), \(\mu \otimes \kappa \) is the law of \((X, Y)\) and \(\mu \otimes (\eta \circ \kappa )\) is the law of \((X, Z)\). With the usual \(I(X, Y)\) notation, the corollary statement is \(I(X, Z) \le I(X, Y)\) .
First, rewrite \(\mu \otimes (\eta \circ \kappa ) = (\mathrm{id} \parallel \eta ) \circ (\mu \otimes \kappa )\). Then apply Theorem 3.1.4.
3.2 New definitions related to the mutual information
Let \(D\) be a divergence between measures. The left \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is
Let \(D\) be a divergence between measures. The right \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is
For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),
For \(\mu \in \mathcal P(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),
For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),
Data-processing inequality
If the divergence \(D\) satisfies the data-processing inequality, then for all \(\mu \in \mathcal M(\mathcal X)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and all Markov kernels \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) then
Let \(\pi \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) of an estimation task with kernel \(P\) takes values in \([0,1]\). Then
The left mutual information \(I_D^L(\pi , P)\) is an infimum: \(I_D^L(\pi , P) = \inf _{\xi \in \mathcal P(\mathcal X)} D(\pi \times \xi , \pi \otimes P)\). For all \(\xi \in \mathcal P(\mathcal X)\), by Lemma 2.3.63,
The risk of a constant Markov kernel is equal to the risk of the discard kernel \(d_\Theta \) (TODO ref).
The proof for the right mutual information is similar.