2 Information divergences
A divergence between measures is a function which maps two measures on any common space \(\mathcal X\) to a value in \([-\infty , +\infty ]\). For two probability measures \(\mu \) and \(\nu \), we will want the divergence \(D(\mu , \nu )\) to be non-negative and to be zero if \(\mu = \nu \) : divergences are meant to be notions of dissimilarity between measures.
We will pay particular attention to how the divergence behaves with respect to transformations of the measures. For a Markov kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) , a fundamental property of all divergences we will consider is the data-processing inequality, which states that \(D(\kappa \circ \mu , \kappa \circ \nu ) \le D(\mu , \nu )\). That is, processing data with a (possibly random) common transformation \(\kappa \) can only lose information about the difference between the two measures.
Another important concept is the conditional divergence of two kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) given a measure \(\mu \in \mathcal M(\mathcal X)\). This is the value \(D(\mu \otimes \kappa , \mu \otimes \eta )\).
TODO: two possible concepts of conditional divergence are \(D(\mu \otimes \kappa , \mu \otimes \eta )\) and \(\mu \left[x \mapsto D(\kappa (x), \eta (x))\right]\). They are equal for \(f\)-divergences (which gives also convexity of the divergence) but not in general.
2.1 Generic divergences
A divergence between measures is a function \(D\) which for any measurable space \(\mathcal X\) and any two measures \(\mu , \nu \in \mathcal M(\mathcal X)\), returns a value \(D(\mu , \nu ) \in \mathbb {R} \cup \{ +\infty \} \).
In Lean, such a definition for a divergence would have to restrict the possible measurable spaces to be in a common universe. We don’t implement that definition, but will only implement particular divergences.
A divergence \(D\) is said to satisfy the data-processing inequality (DPI) if for all measurable spaces \(\mathcal X, \mathcal Y\), all \(\mu , \nu \in \mathcal M(\mathcal X)\) and all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),
Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\). Then
Use the DPI for the deterministic kernel of the function \((x,y) \mapsto y\).
Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then
Use the DPI for the deterministic kernel of the function \((x,y) \mapsto x\).
Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then
Lemma 2.1.2 gives one inequality. The other inequality is the DPI: \(\mu \otimes \kappa = (\mathrm{id} \times \kappa ) \circ \mu \).
Let \(D\) be a divergence. The conditional divergence of kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) with respect to a measure \(\mu \in \mathcal M(\mathcal X)\) is \(\mu [x \mapsto D(\kappa (x), \eta (x)]\). It is denoted by \(D(\kappa , \eta \mid \mu )\).
For the important class of \(f\)-divergences that we will introduce later, \(D(\kappa , \eta \mid \mu ) = D(\mu \otimes \kappa , \mu \otimes \eta )\).
Let \(D\) be a divergence that satisfies the DPI and for which \(D(\kappa , \eta \mid \mu ) = D(\mu \otimes \kappa , \mu \otimes \eta )\). Let \(\mathcal\mu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then
This is a special case of Lemma 2.1.1.
2.2 Statistical divergences
We define a family of divergences based on the risk of a binary estimation task.
2.2.1 Statistical information
The statistical information between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(\mathcal I_\xi (\mu , \nu ) = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \mathcal B_\xi (\mu , \nu )\). This is the risk increase \(I_\xi ^P(d_{\mathcal X})\) in the binary hypothesis testing problem for \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) the Markov kernel to the point space.
The statistical information is the difference between the minimal risk of an estimator that does not depend on the data and that of a Bayes estimator. This is a simple generalization of both the DeGroot statistical information as well as the hockey-stick (or \(E_\gamma \)) divergence.
For \(\mu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \mu ) = 0\) .
Use Lemma 1.2.6.
For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \nu ) \ge 0\) .
Use Lemma 1.2.8.
For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \nu ) \le \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} \) .
Use Lemma 1.2.7.
For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\xi \in \mathcal M(\{ 0,1\} )\), \(\mathcal I_\xi (\mu , \nu ) = \mathcal I_{\xi _\leftrightarrow }(\nu , \mu )\) .
Use Lemma 1.2.9.
For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\xi \in \mathcal M(\{ 0,1\} )\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\mathcal I_\xi (\kappa \circ \mu , \kappa \circ \nu ) \le \mathcal I_\xi (\mu , \nu )\) .
Since \(\kappa \) is a Markov kernel,
It suffices to prove \(\mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \ge \mathcal B_\xi (\mu , \nu )\) since \(\mathcal I_\xi (\mu , \nu ) = \min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} - \mathcal B_\xi (\mu , \nu )\). The inequality was proved in Theorem 1.2.5.
Alternatively, we can use Lemma 1.1.19.
Let \(\mu , \nu \) be two measures on \(\mathcal X\), \(\xi \in \mathcal M(\{ 0,1\} )\) and let \(E\) be an event on \(\mathcal X\). Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\mathcal I_\xi (\mu , \nu ) \ge \mathcal I_\xi (\mu _E, \nu _E)\).
Apply Theorem 2.2.5 to the deterministic kernel \(\mathcal X \rightsquigarrow \{ 0,1\} \) given by the function \(x \mapsto \mathbb {I}\{ x \in E\} \).
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
Apply Theorem 1.2.12 to get that result for \(\zeta = P \circ \xi \). Then for other \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) , use \(\frac{d \mu }{d\zeta } = \frac{d \mu }{d(P \circ \xi )} / \frac{d (P \circ \xi )}{d\zeta }\) .
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
Apply Corollary 1.2.13 and write \(\min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} = \frac{1}{2}\left((P \circ \xi )(\mathcal X) - \left\vert \xi _0\mu (\mathcal X) - \xi _1\nu (\mathcal X) \right\vert \right)\) to get the result for \(\zeta = P \circ \xi \). The result for other \(\zeta \) follows from simple manipulations of Radon-Nikodym derivatives.
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),
For probability measures, this theorem makes the statistical information an \(f\)-divergence (which is defined later in this document).
By Theorem 1.2.12,
Suppose that \(\xi _0\mu (\mathcal X) \le \xi _1\nu (\mathcal X)\) . Then
If on the other hand \(\xi _0\mu (\mathcal X) \ge \xi _1\nu (\mathcal X)\) , with similar computations,
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),
This is a direct application of Lemma 1.2.11.
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of the statistical information.
The DeGroot statistical information between finite measures \(\mu \) and \(\nu \) for \(\pi \in [0,1]\) is \(I_\pi (\mu , \nu ) = \mathcal I_{(\pi , 1 - \pi )}(\mu , \nu )\) .
The \(E_\gamma \) or hockey-stick divergence between finite measures \(\mu \) and \(\nu \) for \(\gamma \in (0,+\infty )\) is \(E_\gamma (\mu , \nu ) = \mathcal I_{(1,\gamma )}(\mu , \nu )\) .
Note that our definition of the \(E_\gamma \) divergence extends the one from the literature: the divergence is usually defined for \(\gamma \ge 1\) only. The extension makes sense in light of Theorem 2.3.35. It is also extended from probability measures to finite measures.
2.2.2 Total variation distance
The total variation distance between finite measures \(\mu \) and \(\nu \) is \(\operatorname{TV}(\mu , \nu ) = \mathcal I_{(1,1)}(\mu , \nu )\) .
This is also equal to \(2 I_{1/2}(\mu , \nu )\) and to \(E_1(\mu , \nu )\).
For \(\mu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \mu ) = 0\) .
Use Lemma 2.2.1.
For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) \ge 0\) .
Use Lemma 2.2.2.
For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) \le \min \{ \mu (\mathcal X), \nu (\mathcal X)\} \) .
Use Lemma 2.2.3.
For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\operatorname{TV}(\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{TV}(\mu , \nu )\) .
Apply Theorem 2.2.5.
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of the total variation distance.
For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
Apply Lemma 2.2.7.
For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
Apply Lemma 2.2.8.
For finite measures \(\mu , \nu \),
Apply Theorem 2.2.9.
For finite measures \(\mu , \nu \),
Apply Corollary 2.2.10.
For finite measures \(\mu , \nu \),
Apply Lemma 2.2.11.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) with \(\mu (\mathcal X) \leq \nu (\mathcal X)\).
Then \(TV(\mu , \nu ) = \sup _{E \text{ event}} \left( \mu (E) - \nu (E) \right)\).
Starting from Lemma 2.2.23,
Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right) \le \operatorname{TV}(\mu , \nu )\).
Let \(p,q\) be the respective densities of \(\mu , \nu \) with respect to \(\xi =\mu +\nu \). For any \(f \in \mathcal F\),
The last equality is Lemma 2.2.20.
Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) = \frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right)\).
Lemma 2.2.25 gives \(\operatorname{TV}(\mu , \nu ) \ge \frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right)\).
The bayes estimator is a function of the form \(x \mapsto \mathbb {I}\{ x \in E\} \) for some event \(E\), hence it belongs to \(\mathcal F\). This gives equality.
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\operatorname{TV}(\mu , \nu ) \ge \operatorname{TV}(\mu _E, \nu _E)\).
Apply Corollary 2.2.6.
2.3 f-divergences
2.3.1 Definition and basic properties
Everywhere in this document, the functions used in divergences are assumed measurable, continuous and convex on \([0, +\infty )\). For such a function \(f : \mathbb {R} \to \mathbb {R}\), we write \(f'(\infty ) = \lim _{x \to +\infty } f(x)/x\) (which can be infinite).
For \(\mu , \nu \) two measures, we denote by \(\mu _{\perp \nu }\) the singular part of \(\mu \) with respect to \(\nu \).
Let \(f : \mathbb {R} \to \mathbb {R}\) and let \(\mu , \nu \) be two measures on a measurable space \(\mathcal X\). The f-divergence between \(\mu \) and \(\nu \) is
if \(x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\) is \(\nu \)-integrable and \(+\infty \) otherwise.
For \(\mu \) and \(\nu \) two finite measures, \(D_f(\mu , \nu )\) is finite if and only if \(x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\) is \(\nu \)-integrable and either \(f'(\infty ) {\lt} \infty \) or \(\mu \ll \nu \).
Divergences associated with basic (transformations of) functions
For \(\nu \) a finite measure, for all \(a \in \mathbb {R}\), \(D_{x \mapsto a}(\mu , \nu ) = a \nu (\mathcal X)\).
Compute the integral.
If \(f(1) = 0\) then \(D_{f}(\mu , \mu ) = 0\).
\(\frac{d \mu }{d \mu }(x) = 1\) almost everywhere and \(f(1) = 0\).
Let \(\mu , \nu \) be two probability measures. Assume \(f'(\infty ) = + \infty \) and \(f(1) = 0\). Then \(D_f(\mu , \nu ) = 0\) if and only if \(\mu = \nu \).
If \(\mu = \nu \) then \(D_f(\mu , \nu ) = 0\) by Lemma 2.3.3. For the other direction use the strict version of Jensen’s inequality.
\(D_{x \mapsto x}(\mu , \nu ) = \mu (\mathcal X)\).
Compute the integral: its value is \((\frac{d\mu }{d\nu }\cdot \nu )(\mathcal X)\). Then \(D_{x\mapsto x}(\mu , \nu ) = (\frac{d\mu }{d\nu }\cdot \nu )(\mathcal X) + \mu _{\perp \nu }(\mathcal X) = \mu (\mathcal X)\).
For all \(a \ge 0\), \(D_{a f}(\mu , \nu ) = a D_{f}(\mu , \nu )\).
Linearity of the integral.
\(D_{f + g}(\mu , \nu ) = D_f(\mu , \nu ) + D_g(\mu , \nu )\).
Linearity of the integral.
For finite measures \(\mu \) and \(\nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\), for all \(a \in \mathbb {R}\), \(D_{f + a(x - 1)}(\mu , \nu ) = D_{f}(\mu , \nu )\).
Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X\) and let \(g : \mathcal X \to \mathcal Y\) be a measurable embedding. Then \(D_f(g_* \mu , g_* \nu ) = D_f(\mu , \nu )\).
\(D_f(\mu , \nu ) = D_{x \mapsto xf(1/x)}(\nu , \mu )\) .
\((\mu , \nu ) \mapsto D_f(a \mu + b \nu , \nu )\) is an \(f\)-divergence for the function \(x \mapsto f(ax + b)\)
\((\mu , \nu ) \mapsto D_f(\mu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{x}{ax+b}\right)\) .
\((\mu , \nu ) \mapsto D_f(\nu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{1}{ax+b}\right)\) .
Upper and lower bounds on \(D_f\)
Let \(\mu _1, \mu _2\) and \(\nu \) be finite measures on \(\mathcal X\), with \(\mu _1 \ll \nu \) and \(\mu _2 \perp \nu \). Then \(D_f(\mu _1 + \mu _2, \nu ) = D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).
\(\frac{d(\mu _1 + \mu _2)}{d \nu } = \frac{d \mu _1}{d \nu }\) a.e. and \((\mu _1 + \mu _2)_{\perp \nu } = \mu _2\).
Let \(\mu _1, \mu _2, \nu \) be three finite measures on \(\mathcal X\) with \(\mu _1 \ll \nu \) and \(\mu _2 \ll \nu \). Then \(D_f(\mu _1 + \mu _2, \nu ) \le D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).
Let \(\mu _1, \mu _2, \nu \) be three finite measures on \(\mathcal X\). Then \(D_f(\mu _1 + \mu _2, \nu ) \le D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).
From Lemma 2.3.14, then Lemma 2.3.15,
Let \(\mu \) and \(\nu \) be two finite measures on \(\mathcal X\). Then \(D_f(\mu , \nu ) = D_f(\frac{d\mu }{d\nu }\cdot \nu , \nu ) + f'(\infty ) \mu _{\perp \nu }(\mathcal X)\).
Apply Lemma 2.3.14 to \(\frac{d\mu }{d\nu }\cdot \nu \) and \(\mu _{\perp \nu }\).
Let \(\mu \) be a finite measure and \(\nu \) be a probability measure on the same space \(\mathcal X\), such that \(\mu \ll \nu \). Then \(f(\mu (\mathcal X)) \le D_f(\mu , \nu )\).
Since \(\mu \ll \nu \) the \(f\)-divergence is only the integral part. Then by Jensen’s inequality,
Let \(\mu \) be a finite measure and \(\nu \) be a probability measure on the same space \(\mathcal X\). Then \(f(\mu (\mathcal X)) \le D_f(\mu , \nu )\).
By convexity, then Lemma 2.3.18 and finally Lemma 2.3.17,
Let \(\mu , \nu \) be two probability measures. If \(f(1) = 0\) then \(D_f(\mu , \nu ) \ge 0\).
Apply Lemma 2.3.19 and use \(f(\mu (\mathcal X)) = f(1) = 0\).
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of \(f\)-divergences.
2.3.2 Conditional f-divergence
TODO: replace the definition by \(D(\mu \otimes \kappa , \mu \otimes \eta )\), which allows for a more general proof of the “conditioning increases divergence” inequality.
Let \(f : \mathbb {R} \to \mathbb {R}\), \(\mu \) a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels from \(\mathcal X\) to \(\mathcal Y\). The conditional f-divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is
if \(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.
Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels. If \(f(1) = 0\) then \(D_f(\kappa , \eta \mid \mu ) \ge 0\).
Apply Lemma 2.3.20.
Let \(\mu , \nu \) be measures on \(\mathcal X\), where \(\mu \) is finite, and let \(\xi \) be a finite measure on \(\mathcal Y\). Then \(D_f(x \mapsto \mu , x \mapsto \nu \mid \xi ) = D_f(\mu , \nu ) \xi (\mathcal X)\).
Let \(\mu \) be a finite measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels from \(\mathcal X\) to \(\mathcal Y\). Then \(p \mapsto f \left(\frac{d(\mu \otimes \kappa )}{d(\mu \otimes \eta )}(p)\right)\) is \((\mu \otimes \eta )\)-integrable iff
\(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and
for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable.
Since \(x \mapsto f \left(\frac{d(\mu \otimes \kappa )}{d(\mu \otimes \eta )} x \right)\) is measurable, the integrability condition w.r.t. \(\mu \otimes \eta \) is equivalent to
\(x \mapsto \int _y \left\Vert f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \right\Vert \partial \eta (x)\) is \(\mu \)-integrable and
for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable.
It suffices to use convexity to show that integrability of \(x \mapsto \int _y f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \partial \eta (x)\) implies integrability of \(x \mapsto \int _y \left\Vert f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \right\Vert \partial \eta (x)\).
TODO
Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, where \(\kappa \) is a Markov kernel. Then \(D_f(\kappa , \eta \mid \mu ) \ne \infty \) if and only if
for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable,
\(x \mapsto \int _y f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \partial \eta (x)\) is \(\mu \)-integrable,
either \(f'(\infty ) {\lt} \infty \) or for \(\mu \)-almost all \(x\), \(\kappa (x) \ll \eta (x)\).
Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\mu \otimes \kappa , \mu \otimes \eta ) \ne \infty \iff D_f(\kappa , \eta \mid \mu ) \ne \infty \).
Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\).
By Lemma 2.3.26, the conditions on which the two divergences are finite are the same. We then assume integrability properties such that both are finite (given by Lemma 2.3.25).
TODO: the following proof assumes \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\). Describe the general case.
By Lemma B.2.9 and Corollary B.1.2,
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of conditional \(f\)-divergences.
2.3.3 Integral representation of \(f\)-divergences
We prove that any \(f\)-divergence is an integral over priors of statistical informations, in which the measure depends on \(f\). The main references for this part are [ LV06 ] and [ Lie12 ] .
Let \(f: \mathbb {R} \to \mathbb {R}\) be a convex function. Then its right derivative \(f'_+(x) \coloneqq \lim _{y \downarrow x}\frac{f(y) - f(x)}{y - x}\) is a Stieltjes function (a monotone right continuous function) and it defines a measure \(\gamma _f\) on \(\mathbb {R}\) by \(\gamma _f((x,y]) \coloneqq f'_+(y) - f'_+(x)\) . [ Lie12 ] calls \(\gamma _f\) the curvature measure of \(f\).
For \(a \ge 0\) and \(f: \mathbb {R} \to \mathbb {R}\) a convex function, the curvature measure of \(af\) is \(\gamma _{af} = a \gamma _f\) .
For \(f,g: \mathbb {R} \to \mathbb {R}\) two convex functions, the curvature measure of \(f+g\) is \(\gamma _{f+g} = \gamma _f + \gamma _g\) .
If \(f\) and \(g\) are two Stieltjes functions with associated measures \(\mu _f\) and \(\mu _g\) and \(f\) is continuous on \([a, b]\), then
For \(f: \mathbb {R} \to \mathbb {R}\) a convex function and \(x,y \in \mathbb {R}\),
Let \(\Lambda \) be the Lebesgue measure and let \(x {\lt} y\). Since \(f\) has right derivative \(f'_+\) in \((x,y)\) and \(f'_+\) is integrable on that interval,
We now integrate by parts, using Theorem 2.3.31. \(\Lambda \) is the measure obtained from the Stieltjes function \(z \mapsto z - y\).
Putting the two equalities together, we get the result for \(x {\lt} y\). The proof for \(x {\gt} y\) is similar, and for \(x = y\) both sides of both equations are zero.
For \(a,b \in (0, +\infty )\) let \(\phi _{a,b} : \mathbb {R} \to \mathbb {R}\) be the function defined by
For \(f: \mathbb {R} \to \mathbb {R}\) a convex function, for all \(x \in \mathbb {R}\) ,
By Lemma 2.3.32, for \(x {\gt} 1\),
That final integral is equal to \(\int _y \phi _{1,y}(x) \partial \gamma _f\) (without the restriction to \((1, +\infty )\)) because \(\phi _{1,y}(x)\) is zero outside of \((1, +\infty )\) for \(x \ge 1\).
The case of \(x {\lt} 1\) is similar.
The curvature measure of the function \(\phi _{a,b}\) is \(\gamma _{\phi _{a,b}} = a\delta _{b/a}\) , where \(\delta _x\) is the Dirac measure at \(x\).
The following theorem is inspired from [ LV06 , Lie12 ] but gives a different integral representation.
For two finite measures \(\mu , \nu \in \mathcal M(\mathcal X)\),
For probability measures, \(D_{\phi _{1,y}}(\mu , \nu ) = E_y(\mu , \nu )\) , but these functions can differ if \(\mu \) and \(\nu \) are finite measures with different masses.
We prove the result for \(f(1) = 0\) and \(f'_+(1) = 0\) for simplicity of exposition.
From Corollary 2.3.33 we get that \(f(x) = \int _y \phi _{1,y}(x) \partial \gamma _f\) . From this and the theorem of Fubini,
For the mutually singular part of the divergence \(D_f\),
The last integral is equal to \(\int _{y \in \mathbb {R}} \phi _{1,y}'(\infty ) \partial \gamma _f\) since \(\phi _{1,y}'(\infty ) = 0\) for \(y \le 1\). Finally,
TODO: define \(\Gamma _f\).
TODO: in the lemma below, \(I_\pi (\mu , \nu )\) is the correct value for probability measures, but in general it should be replaced by \(D_{\phi _{\pi , 1 - \pi }}(\mu , \nu )\) . These functions can differ if \(\mu \) and \(\nu \) are finite measures with different masses.
2.3.4 Data-processing inequality
This section contains three proofs of the data-processing inequality for \(f\)-divergences. That inequality states that for \(\mu , \nu \) two measures on \(\mathcal X\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).
A first proof uses the convexity of \(f\) through Jensen’s inequality for conditional expectations. We first prove the DPI for measurable functions (or deterministic kernels) then use it to extend to the general case.
A second proof deals with the general case and uses Radon-Nikodym derivatives of kernels. This restrict its applicability to measurable spaces that satisfy some particular countability assumptions.
The last proof uses the representation of \(f\)-divergence as an integral over statistical (risk-based) divergences, then uses the data-processing inequality for those divergences.
Currently we don’t have Lean proofs of Jensen’s inequality for conditional expectations (used in proof 1) or of an integration by part lemma that is used in proof 3. Proof 2 gives the weakest results since it is restricted to spaces that satisfy countability assumptions, but it’s the only one for which our Lean proof is sorry-free.
We first remark that it is sufficient to prove the DPI under an absolute continuity assumption on the measures.
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel and \(\nu \in \mathcal M(\mathcal X)\) be a finite measure. Suppose that for all finite measures \(\mu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \), \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\). Then the same is true without the absolute continuity hypothesis.
We use \(\kappa \circ \mu = \kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ) + \kappa \circ \mu _{\perp \nu }\) . By Lemma 2.3.16,
Since \(\kappa \) is a Markov kernel, \((\kappa \circ \mu _{\perp \nu })(\mathcal Y) = \mu _{\perp \nu }(\mathcal Y)\).
By hypothesis, since \(\frac{d\mu }{d\nu }\cdot \nu \ll \nu \), \(D_f\left(\kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ), \kappa \circ \nu \right) \le D_f\left(\frac{d\mu }{d\nu }\cdot \nu , \nu \right)\). We get
The next Lemma implies that in order to prove the DPI, it is sufficient to prove \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \kappa )\).
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\mu \otimes \kappa , \nu \otimes \kappa ) = D_f(\mu , \nu )\).
By Lemma B.2.5 and the fact that \((\mu \otimes \kappa )_{\perp \nu \otimes \kappa } = \mu _{\perp \nu } \otimes \kappa \) (Lemma TODO),
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\xi \) be a measure on \(\mathcal Y\). Then \(D_f(\mu \times \xi , \nu \times \xi ) = D_f(\mu , \nu )\).
Apply Lemma 2.3.38 with \(\kappa \) the constant kernel with value \(\xi \).
DPI: proof using Jensen’s inequality for the conditional expectation
This proof lifts the DPI on deterministic kernels to a DPI on all kernels. We first prove the DPI for the special cases of a composition with a deterministic kernel, which is the same as taking the map of the measures by a function.
Let \(\mu , \nu \in \mathcal M (\mathcal X)\) be two finite measures and let \(g : \mathcal X \to \mathcal Y\) be a measurable function. Then \(D_f(g_* \mu , g_* \nu ) \le D_f(\mu , \nu )\).
By Lemma 2.3.37, it is sufficient to prove the inequality with the additional hypothesis \(\mu \ll \nu \) (hence also \(g_* \mu \ll g_*\nu \)). In that case,
By Lemma B.3.1, \(\nu \)-almost everywhere,
Using that expression, then Jensen’s inequality for the conditional expectation(Theorem C.0.2), we get
The integral of a conditional expectation is equal to the integral of the function, hence
This is equal to \(D_f(\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).
Let \(\pi _Y : \mathcal X \times \mathcal Y \to \mathcal Y\) be the projection \(\pi _Y((x,y)) = y\). Then \(\kappa \circ \mu = \pi _{Y*}(\mu \otimes \kappa )\) and similarly for \(\nu \). We can use the DPI for that function (Theorem 2.3.40), then Lemma 2.3.38,
DPI: proof using derivatives of kernels
This proof does not use Jensen’s inequality for the conditional expectation, but only for the ordinary expectation. It starts by establishing \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\) for Markov kernels \(\kappa \) and \(\eta \), then deduces the DPI. Indeed, if the measurable spaces are standard Borel, we can write \(\mu \otimes \kappa = ((\kappa \circ \mu ) \otimes \kappa ^\dagger _\mu )_\leftrightarrow \) for a Markov kernel \(\kappa ^\dagger _\mu \) (see Definition 48; \((...)_\leftrightarrow \) is notation for composition with the map that swaps the coordinates) and then use the inequality to get
The main drawback of this approach is that it requires that the Bayesian inverse exists, which is not always the case, unless for example the measurable spaces are standard Borel.
The inequality \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\) is obtained by describing the Radon-Nikodym derivative \(\frac{d(\mu \otimes \kappa )}{d(\nu \otimes \eta )}\) and using convexity properties of the function \(f\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).
TODO: the following proof assumes \(\mu \ll \nu \) and \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\). Describe the general case.
Since \(f\) is convex, by Jensen’s inequality,
Since \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\), \(\int _y \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \partial \eta (x) = \int _y 1 \partial \kappa (x) = 1\) a.e.. We have obtained
Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\) where \(\mathcal Y\) is standard Borel, and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mathcal X\) standard Borel and \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures on a standard Borel space \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\)
By definition, \(\kappa \circ \mu \) is the marginal of \(\mu \otimes \kappa \) (a measure on \(\mathcal X \times \mathcal Y\)) on \(\mathcal Y\), and similarly for the other measure. Hence by Theorem 2.3.43, \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).
Let \(\mu \) be a measure on a standard Borel space \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\)
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel, where both \(\mathcal X\) and \(\mathcal Y\) are standard Borel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).
DPI: proof using the integral representation
This proof uses that every \(f\)-divergence is an integral over \(\gamma \ge 0\) of \(f\)-divergences for the function \(\phi _{1,\gamma }\) of Definition 20. It suffices then to prove the DPI for those particular \(f\)-divergences. Those are equal to the \(E_\gamma \) divergence, up to a term that is invariant by composition with a Markov kernel. Finally, we already know that the \(E_\gamma \) divergence satisfies the DPI, and that completes the proof.
Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).
in which \(\text{sign}(b-a)\) is \(1\) if \(b-a {\gt} 0\) and \(-1\) if \(b-a \le 0\).
If \(a \le b\),
Similar computations lead to the result for \(b \le a\).
Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).
Let \(a,b \in [0, +\infty )\). Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_{\phi _{a,b}}(\kappa \circ \mu , \kappa \circ \nu ) \le D_{\phi _{a,b}}(\mu , \nu )\).
By Corollary 2.3.48,
The statistical information satisfies the data-processing inequality (Theorem 2.2.5) and the other terms are invariant by composition with a Markov kernel, hence \(D_{\phi _{a,b}}\) also satisfies the data-processing inequality.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).
By Theorem 2.3.35,
The first two terms are unchanged by composition with a Markov kernel. It thus suffices to prove the data processing inequality for the divergences \(D_{\phi _{1, \gamma }}\). This was done in Lemma 2.3.49.
2.3.5 Sigma-algebras
For \(\mathcal A\) a sub-\(\sigma \)-algebra and \(\mu \) a measure, we write \(\mathcal\mu _{| \mathcal A}\) for the measure restricted to the \(\sigma \)-algebra. The measure \(\mathcal\mu _{| \mathcal A}\) is equal to the map of \(\mu \) by the identity, seen as a function to \(\mathcal X\) with the \(\sigma \)-algebra \(\mathcal A\). We can thus get a DPI for sub-sigma-algebras by using the DPI for maps.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\mathcal A\) be a sub-\(\sigma \)-algebra of \(\mathcal X\). Then \(D_f(\mu _{| \mathcal A}, \nu _{| \mathcal A}) \le D_f(\mu , \nu )\).
The measure \(\mathcal\mu _{| \mathcal A}\) is equal to the map of \(\mu \) by the identity, seen as a function to \(\mathcal X\) with the \(\sigma \)-algebra \(\mathcal A\). That function is measurable since \(\mathcal A\) is a sub-sigma-algebra. We can apply Theorem 2.3.40.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\). Then \(\sup _{\mathcal A \text{ finite}} D_f(\mu _{| \mathcal A}, \nu _{| \mathcal A}) = D_f(\mu , \nu )\).
Let \(\mu , \nu \in \mathcal M(\mathcal X)\) be finite measures with \(\mu \ll \nu \) and let \(g : \mathcal X \to \mathcal Y\) be a measurable function. Denote by \(g^* \mathcal Y\) the comap of the \(\sigma \)-algebra on \(\mathcal Y\) by \(g\). Then \(D_f(g_* \mu , g_* \nu ) = D_f(\mu _{| g^* \mathcal Y}, \nu _{| g^* \mathcal Y})\) .
Using Lemma B.4.2,
2.3.6 Consequences of the data-processing inequality
Direct applications
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels. Then \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).
Since \(\kappa \) is a Markov kernel, \(\mu \) is the composition of \(\mu \otimes \kappa \) and the deterministic kernel for the function \((x,y) \mapsto x\). The same applies for \(\nu \) and \(\nu \otimes \eta \). The result is then an application of the data-processing inequality, Theorem 2.3.50.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Then \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).
\(\kappa \circ \mu \) is the composition of \(\mu \otimes \kappa \) and the deterministic kernel for the function \((x,y) \mapsto y\). The same applies for \(\eta \circ \nu \) and \(\nu \otimes \eta \). The result is then an application of the data-processing inequality, Theorem 2.3.50.
Let \(\mu \) be a finite measure \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\) .
This is a particular case of Theorem 2.3.55.
Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).
The measure \(\mu _X\) is the composition of \(\mu \) and the deterministic kernel for the function \((x,y) \mapsto x\), and similarly \(\mu _Y\) is the composition of \(\mu \) and a deterministic kernel. The results are both applications of the data-processing inequality, Theorem 2.3.50.
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow (\mathcal X \times \mathcal Y)\) be a Markov kernel such that for all \(x\), \((\kappa (x))_X = \delta _x\). Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) = D_f(\mu , \nu )\).
Data-processing and mean values
In this section \(d_f(p, q)\) denotes the \(f\)-divergence between two Bernoulli distributions with means \(p\) and \(q\).
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Then \(D_f(\mu , \nu ) \ge d_f(\mu (E), \nu (E))\).
Use the deterministic kernel \(\kappa : \mathcal X \rightsquigarrow \{ 0, 1\} \) with \(\kappa (x) = \delta _1 \mathbb {I}\{ x \in E\} + \delta _0 \mathbb {I}\{ x \notin E\} \) in Theorem 2.3.50.
For all \(y \in [0,1]\), \(x \mapsto d_f(x, y)\) is convex and attains a minimum at \(x = y\).
\(f'_+\) is non-decreasing, hence we get that \((x \mapsto d_f(x, y))'_+\) is also non-decreasing and \(x \mapsto d_f(x, y)\) is convex.
For \(x \le y\), \(x/y \le 1 \le (1-x)/(1-y)\), hence \(f'_+(\frac{x}{y}) \le f'_+(\frac{1 - x}{1 - y})\) and \((x \mapsto d_f(x, y))'_+(x) \le 0\). The opposite holds for \(x \ge y\).
Let \(\mu , \nu \in \mathcal P([0,1])\). Then
Let \(u\) be the uniform distribution on \([0,1]\). Then \(D_f(\mu , \nu ) = D_f(\mu \times u, \nu \times u)\) (by Lemma 2.3.58). Let \(g : [0,1] \times [0,1] \to \{ 0,1\} \) be the measurable function with \(g(x, y) = \mathbb {I}\{ y \le x\} \). We also write \(g\) for the corresponding deterministic kernel. By the data-processing inequality (Theorem 2.3.50),
Those two last measures are Bernoulli distributions with means \(\mu [X]\) and \(\nu [X]\).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow [0,1]\). Then
Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then
Suppose first that \(\mathcal R_\pi ^Q \ge \mathcal R_\xi ^P\). Let \(\hat{y}_B\) be a Bayes estimator for the estimation task \((P, y, \ell ')\) with prior \(\xi \). If it does not exist, the proof can be adapted by taking an estimator with risk \(\varepsilon \)-close to the Bayes risk for any \(\varepsilon {\gt} 0\). By Corollary 2.3.62,
By definition of the Bayes risk as an infimum,
Since \(x \mapsto d_f(x, y)\) is monotone on \([y, 1]\),
This concludes the proof for \(\mathcal R_\pi ^Q \ge \mathcal R_\xi ^P\). If that inequality is reversed, we proceed similarly but take the Bayes estimator for \(Q\).
2.3.7 Convexity
The function \((\mu , \nu ) \mapsto D_f(\mu , \nu )\) is convex.
Let \(\mu _0, \mu _1, \nu _0, \nu _1\) be four measures. Let \(\lambda \in [0,1]\). Let \(\xi \) be the probability measure on \(\{ 0,1\} \) with \(\xi (\{ 1\} ) = \lambda \). Let \(\kappa \) be the kernel \(\{ 0,1\} \rightsquigarrow \mathcal X\) defined by \(\kappa (0) = \mu _0\) and \(\kappa (1) = \mu _1\). Let \(\eta \) be the kernel \(\{ 0,1\} \rightsquigarrow \mathcal X\) defined by \(\eta (0) = \nu _0\) and \(\eta (1) = \nu _1\).
By Theorem 2.3.56, \(D_f(\kappa \circ \xi , \eta \circ \xi ) \le D_f(\xi \otimes \kappa , \xi \otimes \eta )\).
2.3.8 Variational representations
TODO
2.3.9 Comparisons between divergences
We collate here general tools that can be used to compare \(D_f\) and \(D_g\) for two functions \(f\) and \(g\). Specific divergences are discussed in the next sections.
If \(f \le g\), then \(D_f(\mu , \nu ) \le D_g(\mu , \nu )\).
Monotonicity of the integral and of \(f \mapsto f'(\infty )\).
If \(f(1) = 0\), \(g(1) = 0\), \(f'(1) = 0\), \(g'(1) = 0\), and both \(f\) and \(g\) have a second derivative, then
By Taylor with integral remainder, if \(f'' \le \beta g''\),
Then use Lemma 2.3.65.
Suppose that \(f(1) = g(1) = 0\) and that \(f'(1) = g'(1) = 0\), and that \(g{\gt}0\) on \((0,1) \cup (1, +\infty )\). Suppose that the space \(\mathcal X\) has at least two disjoint non-empty measurable sets. Then
Remark: if \(\mathcal X\) does not have two disjoint non-empty measurable sets then its measurable sets are only \(\{ \emptyset , \mathcal X\} \) and the ratio on the left is \(0/0\).
TODO: proof done for measurable singletons?
By Lemma 2.3.65, \(\sup _{\mu , \nu \in \mathcal P(\mathcal X)} \frac{D_f(\mu , \nu )}{D_g(\mu , \nu )} \le \sup _{t {\gt} 0, t \ne 1} \frac{f(t)}{g(t)}\). Let’s prove the other inequality.
Let \(x\) and \(y\) be two points in disjoint non-empty measurable sets. For \(t \in (0,1) \cup (1, +\infty )\) and \(\varepsilon \in (0, 1/t]\) let \(\mu _{t, \varepsilon } = t \varepsilon \delta _x + (1 - t \varepsilon ) \delta _y\) and \(\nu _\varepsilon = \varepsilon \delta _x + (1 - \varepsilon ) \delta _y\) .
Let \(\alpha = \frac{\varepsilon (1 - t)}{1 - \varepsilon }\). Then since \(f(1) = g(1) = 0\),
As \(\varepsilon \to 0\), \(\alpha \) also tends to 0 and
Bretagnolle-Huber inequalities
Let \(\mu , \nu \in \mathcal P(\mathcal X)\). If \(f(1) = 0\),
Let \(V: x \mapsto \max \left\{ 0, \frac{d\mu }{d\nu }(x) - 1\right\} \) and \(W: x \mapsto \max \left\{ 0, 1 - \frac{d\mu }{d\nu }(x)\right\} \) . Then one of \(1+V\) and \(1-W\) is equal to \(\frac{d\mu }{d\nu }\) and the other is equal to 1. Since \(f(1) = 0\), we obtain \(f(\frac{d\mu }{d\nu }) = f(1+V) + f(1-W)\) . The \(f\)-divergence is then
By convexity of \(f\), \(f(1+V) + f'(\infty )\mu _{\perp \nu }(\mathcal X) \ge f(1+V + \mu _{\perp \nu }(\mathcal X))\) . Using convexity again, we get
By Lemma 2.2.21, \(\nu [V] + \mu _{\perp \nu }(\mathcal X) = \nu [W] = \operatorname{TV}(\mu , \nu )\). We obtain the first equality.
To get the second equality, remark that \(D_f(\mu , \nu ) = D_{x\mapsto xf(1/x)}(\nu , \mu )\) (Theorem TODO), apply the first inequality to that function and use the symmetry of \(\operatorname{TV}\).
TODO: move this somewhere after the definition of \(KL\).
For \(\mu , \nu \in \mathcal P(\mathcal X)\),
Use the second inequality of Theorem 2.3.68, for \(f: x \mapsto x \log x\) .
TODO: move this somewhere after the definition of \(\operatorname{H}_\alpha \).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\). For \(\alpha {\gt} 0\),
Use the second inequality of Theorem 2.3.68, for \(f: x \mapsto \frac{x^{1+\alpha } - 1}{\alpha }\) to get
We then neglect one term: \((1 + \operatorname{TV}(\mu , \nu ))^{-\alpha } \ge 0\).
Joint range
TODO [ HV11 ]
2.3.10 Statistical divergences
On probability measures, the statistical information \(\mathcal I_\xi \) is an \(f\)-divergence for the function \(\phi _{\xi _0, \xi _1}\) of Definition 20.
This is a reformulation of Theorem 2.2.9.
On probability measures, the total variation distance \(\operatorname{TV}\) is an \(f\)-divergence for the function \(x \mapsto \frac{1}{2}\vert x - 1 \vert \).
This is a reformulation of Lemma 2.2.22.
2.4 Kullback-Leibler divergence
Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Kullback-Leibler divergence between \(\mu \) and \(\nu \) is
if \(\mu \ll \nu \) and \(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable and \(+\infty \) otherwise.
\(\operatorname{KL}(\mu , \nu ) = D_f(\mu , \nu )\) for \(f: x \mapsto x \log x\).
Simple computation.
Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Kullback-Leibler divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is
if \(x \mapsto \operatorname{KL}(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.
\(\operatorname{KL}(\kappa , \eta \mid \mu ) = D_f(\kappa , \eta \mid \mu )\) for \(f: x \mapsto x \log x\).
Simple computation.
2.4.1 Properties inherited from f-divergences
Since \(\operatorname{KL}\) is an f-divergence, every inequality for f-divergences can be translated to \(\operatorname{KL}\).
\(\operatorname{KL}(\mu , \mu ) = 0\).
Let \(\mu , \nu \) be finite measures on \(\mathcal X\), \(\xi \) be a finite measure on \(\mathcal Y\). Then \(\operatorname{KL}(x \mapsto \mu , x \mapsto \nu \mid \xi ) = \operatorname{KL}(\mu , \nu ) \xi (\mathcal X)\).
Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(\operatorname{KL}(\mu _X, \nu _X) \le \operatorname{KL}(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(\operatorname{KL}(\mu _Y, \nu _Y) \le \operatorname{KL}(\mu , \nu )\).
Apply Theorem 2.3.57.
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(\operatorname{KL}(\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{KL}(\mu , \nu )\).
Apply Theorem 2.3.50.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\). Then \(\operatorname{KL}(\mu , \nu ) \ge \mu (\mathcal X) \log \frac{\mu (\mathcal X)}{\nu (\mathcal X)}\).
Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{KL}(\mu , \nu ) \ge 0\).
Apply Lemma 2.4.7 and use \(\mu (\mathcal X) = \nu (\mathcal X)\).
Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{KL}(\mu , \nu ) = 0\) if and only if \(\mu = \nu \).
\((\mu , \nu ) \mapsto \operatorname{KL}(\mu , \nu )\) is convex.
Apply Theorem 2.3.64
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels such that \(\mu \otimes \kappa \ll \nu \otimes \eta \). Then \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is \(\mu \otimes \kappa \)-integrable if and only if the following hold:
\(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable
\(x \mapsto \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x)\) is \(\mu \)-integrable
for \(\mu \)-almost all \(x\), \(y \mapsto \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable
Note that from the hypothesis it easily follows that \(\mu \ll \nu \) and \(\mu \)-a.e. \(\kappa (x) \ll \eta (x)\).
Applying Lemma 2.3.24 we know that \(\log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }\) is \(\nu \otimes \eta \)-integrable iff the following hold:
for \(\nu \)-almost all \(x\), \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \) is \(\eta (x)\)-integrable
\(x \mapsto \int _y \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x)\) is \(\nu \)-integrable
(i.\(\Leftarrow \)) 3. implies that \(\nu \)-a.e. \(y \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable, hence \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\eta (x)\)-integrable \(\nu \)-a.e. Moreover \(y \mapsto \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\eta (x)\)-integrable, then, since \(\frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x)\) is constant in \(y\), \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \mu }{d \nu }(x)\) is \(\eta (x)\)-integrable. This allows us to conclude that \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) + \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \mu }{d \nu }(x) = \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right)\) is \(\eta (x)\)-integrable \(\nu \)-a.e.
(ii.\(\Leftarrow \)) 1. implies that \(x \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x)\) is \(\nu \)-integrable. 2. implies that \(x \mapsto \frac{d \mu }{d \nu }(x) \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \eta (x)\) is \(\nu \)-integrable. Therefore
is \(\nu \)-integrable.
(\(\Rightarrow \)1.) This follows from the hypothesis applying a more general Lemma for f divergences, see ProbabilityTheory.integrable_f_rnDeriv_of_integrable_compProd’ in the lean code.
(\(\Rightarrow \)3.) i. implies that \(\mu \)-a.e. \(y \mapsto \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right)\) is \(\eta (x)\)-integrable, hence \(y \mapsto \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) = \log \frac{d \mu }{d \nu }(x) + \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable. Then, since \(\log \frac{d \mu }{d \nu }(x)\) is constant in \(y\) and \(\kappa (x)\) is a finite measure, \(y \mapsto \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable.
(\(\Rightarrow \)2.) ii. implies that \(x \mapsto \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x) = \int _y \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \kappa (x)\) is \(\mu \)-integrable. Thanks to 3. we can split the integral:
Then using 1. we can conclude that \(x \mapsto \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x)\) is \(\mu \)-integrable.
2.4.2 Chain rule and tensorization
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels. Then \(\operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\mu \otimes \kappa , \mu \otimes \eta )\).
By Lemma B.2.1, \(\mu \otimes \kappa \ll \nu \otimes \eta \iff \left( \mu \ll \nu \ \wedge \ \mu \otimes \kappa \ll \mu \otimes \eta \right)\). Hence if any of the two sides of the equality is infinite due to a lack of absolute continuity, the other is as well. We now suppose that \(\mu \otimes \kappa \ll \nu \otimes \eta \), \(\mu \ll \nu \) and \(\mu \otimes \kappa \ll \mu \otimes \eta \).
TODO: deal with the non-integrable case, without using Lemma 2.4.11 that requires countably generated sigma-algebras.
Suppose then that all three log-likelihood ratios are integrable. We use the chain rule for Radon-Nikodym derivatives (Theorem B.2.7).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \(\operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\kappa , \eta \mid \mu )\).
Handle the case where \(\mu \otimes \kappa \) is not absolutely continuous with respect to \(\nu \otimes \eta \) first, then the case where \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is not \(\mu \otimes \kappa \)-integrable using Lemma 2.4.11.
For the case where \(\mu \otimes \kappa \ll \nu \otimes \eta \) and \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is \(\mu \otimes \kappa \)-integrable use Lemma B.2.11 and Corollary B.1.2 in a computation:
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels with Bayesian inverses \(\kappa _\mu ^\dagger \) and \(\eta _\nu ^\dagger \). Then
Let \(\mu , \nu \) be two finite measures on \(\mathcal X \times \mathcal Y\), where \(\mathcal Y\) is standard Borel. Then \(\operatorname{KL}(\mu , \nu ) = \operatorname{KL}(\mu _X, \nu _X) + \operatorname{KL}(\mu _{Y|X}, \nu _{Y|X} \mid \mu _X)\).
Write \(\mu = \mu _X \otimes \mu _{Y|X}\) and \(\nu = \nu _X \otimes \nu _{Y|X}\), then use Theorem 2.4.13.
Let \(\mu _1, \nu _1\) be finite measures on \(\mathcal X\) and \(\mu _2, \nu _2\) probability measures on \(\mathcal{Y}\). Then
For \(\mu _1\) a probability measure on \(\mathcal X\), \(\nu _1\) a finite measure on \(\mathcal{X}\) and \(\mu _2, \nu _2\) two probability measures on \(\mathcal Y\),
This is a particular case of Lemma 2.4.16.
Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on spaces \((\mathcal X_i)_{i \in I}\). Then
Induction on the size of \(I\), using Theorem 2.4.17.
Let \(\mu , \nu \) be two measures and \(E\) an event. Let \(\mu _{|E}\) be the measure defined by \(\mu _{|E}(A) = \frac{\mu (A \cap E)}{\mu (E)}\) and define \(\nu _{|E}\), \(\mu _{| E^c}\) and \(\nu _{| E^c}\) similarly. Let \(\operatorname{kl}(p,q) = p\log \frac{p}{q} + (1-p)\log \frac{1-p}{1-q}\) be the Kullback-Leibler divergence between Bernoulli distributions with means \(p\) and \(q\). Then
Let \(\mu _E\) be the Bernoulli distribution with \(\mu _E(\{ 1\} ) = \mu (E)\), and define \(\nu _E\) similarly. Let \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel defined by \(\kappa (1) = \mu _{|E}\) and \(\kappa (0) = \mu _{|E^c}\). Define a kernel \(\eta \) similarly for \(\nu \). Then \(\mu = \kappa \circ \mu _E\) and \(\nu = \eta \circ \nu _E\).
\(\mu _E \otimes \kappa \) is equal to the composition of \(\mu \) and the deterministic kernel given by the function \(f : x \mapsto (\mathbb {I}_E(x), x)\). By Lemma 2.3.58, that gives \(\operatorname{KL}(\mu _E \otimes \kappa , \nu _E \otimes \eta ) = \operatorname{KL}(\mu , \nu )\).
Using the chain rule (Theorem 2.4.13),
Let \(\mu , \nu \) be two measures and \(E\) an event. Then \(\mu (E)\log \frac{\mu (E)}{\nu (E)} \le \mu \left[\mathbb {I}(E)\log \frac{d \mu }{d \nu }\right]\) .
Apply Lemma 2.4.7 to the measures \(\mu \) and \(\nu \) restricted to \(E\). \(\mu \left[\mathbb {I}(E)\log \frac{d \mu }{d \nu }\right]\) is the KL between those two distributions.
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of the Kullback-Leibler divergence.
2.5 Hellinger alpha-divergences
Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Hellinger divergence of order \(\alpha \in [0,+\infty )\) between \(\mu \) and \(\nu \) is
with \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).
For \(\alpha \in [0, 1)\) and finite measures \(\mu , \nu \), \(\operatorname{H}_\alpha (\mu , \nu ) {\lt} \infty \).
For \(\alpha \in (0,1)\cup (1, \infty )\), \(\mu \) a finite measure and \(\nu \) a probability measure, if \(\operatorname{H}_\alpha (\mu , \nu ) {\lt} \infty \) then
For \(\alpha \in (0,1)\cup (1, \infty )\), \(\mu , \nu \) two sigma-finite measures,
For \(\alpha \in (0, 1)\) and finite measures \(\mu , \nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\), \((1 - \alpha ) \operatorname{H}_\alpha (\mu , \nu ) = \alpha \operatorname{H}_{1 - \alpha }(\nu , \mu )\).
Use Lemma 2.5.3.
Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Hellinger divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is
for \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).
2.5.1 Properties inherited from f-divergences
Since \(\operatorname{H}_\alpha \) is an f-divergence, every inequality for f-divergences can be translated to \(\operatorname{H}_\alpha \).
Let \(\alpha {\gt} 0\), \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(\operatorname{H}_\alpha (\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{H}_\alpha (\mu , \nu )\).
Apply Theorem 2.3.50.
Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{H}_\alpha (\mu , \nu ) \ge 0\).
Apply Lemma 2.3.20.
\((\mu , \nu ) \mapsto \operatorname{H}_\alpha (\mu , \nu )\) is convex.
Apply Theorem 2.3.64
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of the Hellinger \(\alpha \)-divergence.
2.6 Rényi divergences
Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Rényi divergence of order \(\alpha \in \mathbb {R}\) between \(\mu \) and \(\nu \) is
For \(\mu \) a sigma-finite measure and \(\nu \) a finite measure
For \(\alpha \in [0, 1)\) and finite measures \(\mu , \nu \),
For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then
For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then
For \(\alpha \in (0, 1)\) and finite measures \(\mu , \nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\),
Use Lemma 2.5.4.
The cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\nu \) is \(\alpha \mapsto (\alpha - 1) R_\alpha (\mu , \nu )\).
Unfold the definitions, using Lemma 2.6.3 for the Rényi divergence.
Set \(\alpha {\gt} 0\). If \(R_{1+\alpha }(\mu , \nu ) {\lt} \infty \), the cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\mu \) has value \(\alpha R_{1+\alpha }(\mu , \nu )\) at \(\alpha \).
Unfold the definitions, using Lemma 2.6.4 for the Rényi divergence.
Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Rényi divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is
TODO: the reasons for this definition is that it is equal to \(R_\alpha (\mu \otimes \kappa , \mu \otimes \eta )\), which is the generic conditional divergence definition, and also that this is the definition for which we have a chain rule.
2.6.1 Properties inherited from f-divergences
Since Rényi divergences are monotone transfomations of f-divergences, every inequality for f-divergences can be translated to Rényi divergences.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(R_\alpha (\kappa \circ \mu , \kappa \circ \nu ) \le R_\alpha (\mu , \nu )\).
The function \(x \mapsto \frac{1}{\alpha - 1}\log (1 + (\alpha - 1)x)\) is non-decreasing and \(D_f\) satisfies the DPI (Theorem 2.3.50), hence we get the DPI for \(R_\alpha \).
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\).
By Corollary 2.3.59, \(D_f(\mu , \nu ) \ge D_f(\mu _E, \nu _E)\), hence \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\).
2.6.2 Comparisons between orders
Let \(\mu , \nu \) be two finite measures. \(R_0(\mu , \nu ) = \lim _{\alpha \downarrow 0} R_\alpha (\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures. \(R_1(\mu , \nu ) = \lim _{\alpha \uparrow 1} R_\alpha (\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures such that there exists \(\alpha {\gt} 1\) with \(R_\alpha (\mu , \nu )\) finite. Then \(R_1(\mu , \nu ) = \lim _{\alpha \downarrow 1} R_\alpha (\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is nondecreasing on \([0, + \infty )\).
Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is continuous on \([0, 1]\) and on \([0, \sup \{ \alpha \mid R_\alpha (\mu , \nu ) {\lt} \infty \} )\).
2.6.3 The tilted measure
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\alpha \in (0, +\infty ) \backslash \{ 1\} \). Let \(p = \frac{d \mu }{d (\mu + \nu )}\) and \(q = \frac{d \nu }{d (\mu + \nu )}\). We define a measure \(\mu ^{(\alpha , \nu )}\), absolutely continuous with respect to \(\mu + \nu \) with density
For \(\alpha \in (0,1)\), \(\mu ^{(\alpha , \nu )} \ll \mu \) and \(\mu ^{(\alpha , \nu )} \ll \nu \).
\(\frac{d \mu ^{(\alpha , \nu )}}{d \nu } = \left(\frac{d\mu }{d\nu }\right)^\alpha e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\nu \)-a.e., and \(\frac{d \mu ^{(\alpha , \nu )}}{d \mu } = \left(\frac{d\nu }{d\mu }\right)^{1 - \alpha } e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\mu \)-a.e..
Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then
Unfold definitions and compute?
Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then
Apply Lemma 2.6.17 to \(\xi = \mu ^{(\alpha , \nu )}\).
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then
The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).
By Lemma 2.6.17 and Lemma 2.4.8, for all \(\xi \),
We have then \((1 - \alpha ) R_\alpha (\mu , \nu ) \le \inf _{\xi \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) \right)\). Corollary 2.6.18 proves the other inequality.
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).
Compare with Lemma 2.9.5 on the Jensen-Shannon divergence.
The result is a rephrasing of Lemma 2.6.19.
2.6.4 Chain rule and tensorization
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels. Then \(R_\alpha (\mu \otimes \kappa , \nu \otimes \eta ) = R_\alpha (\mu , \nu ) + R_\alpha (\kappa , \eta \mid \mu ^{(\alpha , \nu )})\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels with Bayesian inverses \(\kappa _\mu ^\dagger \) and \(\eta _\nu ^\dagger \). Then
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\xi , \lambda \in \mathcal P(\mathcal Y)\). Then \(R_\alpha (\mu \times \xi , \nu \times \lambda ) = R_\alpha (\mu , \nu ) + R_\alpha (\xi , \lambda )\).
Apply Theorem 2.6.21 (the product of measures is a special case of \(\otimes \)). TODO: more details.
Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).
Induction over the finite index set, using Corollary 2.6.23.
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(R_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_\alpha (\mu , \nu )\).
Apply Theorem 2.6.24.
Let \(I\) be a countable index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of the Rényi divergence.
2.7 Hellinger distance
TODO: refactor this section to link it to the Hellinger \(\alpha \)-divergence.
Let \(\mu , \nu \) be two measures. The squared Hellinger distance between \(\mu \) and \(\nu \) is
Let \(\mu , \nu \) be two probability measures. Then \(R_{1/2}(\mu , \nu ) = -2\log (1 - \operatorname{H^2}(\mu , \nu ))\).
\(R_{1/2}(\mu , \nu ) = -2 \log (1 - \frac{1}{2} D_f(\nu , \mu ))\) for \(f : x \mapsto -2 (\sqrt{x} - 1)\). Using Lemma 2.3.6, \(R_{1/2}(\mu , \nu ) = -2 \log (1 - D_g(\nu , \mu ))\) for \(g(x) = 1 - \sqrt{x}\).
It suffices then to show that \(\operatorname{H^2}(\mu , \nu ) = D_g(\mu , \nu )\), which is true by an application of Lemma 2.3.8.
Let \(\mu , \nu \) be two probability measures. Then \(2 \operatorname{H^2}(\mu , \nu ) \le R_{1/2}(\mu , \nu )\).
Use Lemma 2.7.1, then \(-\log (1 - x) \ge x\).
Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{H^2}(\mu , \nu ) \le \operatorname{TV}(\mu , \nu )\).
Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{TV}(\mu , \nu ) \le \sqrt{\operatorname{H^2}(\mu , \nu )(2 - \operatorname{H^2}(\mu , \nu ))}\).
Let \(\mu , \nu \) be two probability measures. Then
We use Lemma 2.7.4.
Let \(\mu , \nu \) be two probability measures. Then
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(n \in \mathbb {N}\), and \(\mu ^{\otimes n}, \nu ^{\otimes n}\) be their product measures on \(\mathcal X^n\). Then
2.8 Chernoff divergence
The Chernoff divergence of order \(\alpha {\gt} 0\) between two measures \(\mu \) and \(\nu \) on \(\mathcal X\) is
Note: the name Chernoff divergence is usually reserved for \(C_1\), and any reference to \(C\) without subscript in this document refers to \(C_1\). The extension to other orders \(\alpha \) is motivated by the appearance of \(C_\alpha \) in a result further down. This is not a well established notion and it remains to see if this extension is meaningful.
\(C_1(\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\max \{ \operatorname{KL}(\xi , \mu ), \operatorname{KL}(\xi , \nu )\} \) .
This is \(R_1 = \operatorname{KL}\) (by definition).
\(C_\alpha (\mu , \nu ) = C_\alpha (\nu , \mu )\).
Immediate from the definition.
The function \(\alpha \mapsto C_\alpha (\mu , \nu )\) is monotone.
Consequence of Lemma 2.6.13.
2.9 Jensen-Shannon divergence
The Jensen-Shannon divergence indexed by \(\alpha \in (0,1)\) between two measures \(\mu \) and \(\nu \) is
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
\(\operatorname{JS}_\alpha \) is an \(f\)-divergence for \(f(x) = \alpha x \log (x) - (\alpha x + 1 - \alpha ) \log (\alpha x + 1 - \alpha )\) .
\((\mu , \nu ) \mapsto \operatorname{KL}(\mu , \alpha \mu + (1 - \alpha ) \nu )\) is an \(f\)-divergence for the function \(x \mapsto x\log \left(\frac{x}{\alpha x + 1 - \alpha }\right)\) (Lemma 2.3.12).
\((\mu , \nu ) \mapsto \operatorname{KL}(\nu , \alpha \mu + (1 - \alpha ) \nu )\) is an \(f\)-divergence for the function \(x \mapsto x\log \left(\frac{1}{\alpha x + 1 - \alpha }\right)\) (Lemma 2.3.13).
For \(\alpha \in (0,1)\) and \(\mu , \nu \in \mathcal M(\mathcal X)\),
Immediate from the definition.
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Then
The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).
Compare with Lemma 2.6.20 on the Rényi divergence.
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(\operatorname{JS}_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) \le n \operatorname{JS}_\alpha (\mu , \nu )\).
By Lemma 2.9.4,
2.9.1 Comparisons with other divergences
For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha , \lambda \in (0,1)\) ,
In particular,
The main tool is Lemma 2.3.66. \(\operatorname{JS}_\alpha \) is an \(f\)-divergence with function \(f(x) = \alpha x \log (x) - (\alpha x + 1 - \alpha ) \log (\alpha x + 1 - \alpha )\). On probability measures, \((1 - \lambda )\operatorname{H}_\lambda \) is an \(f\)-divergence with function \(g(x) = 1 - x^\lambda + \lambda (x-1)\). These two functions satisfy \(f(1) = 0\), \(g(1) = 0\), \(f'(1) = 0\), \(g'(1) = 0\) and have second derivatives, hence Lemma 2.3.66 applies and to obtain the inequality it suffices to compute \(\sup _x\frac{f''(x)}{g''(x)}\) .
Let \(h(x) = \frac{f''(x)}{g''(x)} = \frac{\alpha (1 - \alpha )}{\lambda (1 - \lambda )} \frac{x^{1 - \lambda }}{\alpha x + 1 - \alpha }\) . Its maximum is attained at a point where \(h'(x) = 0\).
The optimum for \(h\) is attained at \(x = \frac{(1 - \alpha )(1 - \lambda )}{\alpha \lambda }\) and the maximal value of \(h\) is \(\frac{(1 - \alpha )^{1 - \lambda } \alpha ^\lambda }{\lambda ^{1 - \lambda } (1 - \lambda )^\lambda }\) .
For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha \in (0,1)\), \(\lambda \le 1/2\) ,
By Lemma 2.9.7 for \(\alpha \) and \(1 - \lambda \),
To get the result, use \(1 - \alpha \le 1\) and \(\frac{\lambda }{1 - \lambda } \le 1\) since it is increasing in \(\lambda \) with value 1 at \(\lambda = 1/2\).