Lower bounds for hypothesis testing based on information theory

2 Information divergences

A divergence between measures is a function which maps two measures on any common space \(\mathcal X\) to a value in \([-\infty , +\infty ]\). For two probability measures \(\mu \) and \(\nu \), we will want the divergence \(D(\mu , \nu )\) to be non-negative and to be zero if \(\mu = \nu \) : divergences are meant to be notions of dissimilarity between measures.

We will pay particular attention to how the divergence behaves with respect to transformations of the measures. For a Markov kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) , a fundamental property of all divergences we will consider is the data-processing inequality, which states that \(D(\kappa \circ \mu , \kappa \circ \nu ) \le D(\mu , \nu )\). That is, processing data with a (possibly random) common transformation \(\kappa \) can only lose information about the difference between the two measures.

Another important concept is the conditional divergence of two kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) given a measure \(\mu \in \mathcal M(\mathcal X)\). This is the value \(D(\mu \otimes \kappa , \mu \otimes \eta )\).

TODO: two possible concepts of conditional divergence are \(D(\mu \otimes \kappa , \mu \otimes \eta )\) and \(\mu \left[x \mapsto D(\kappa (x), \eta (x))\right]\). They are equal for \(f\)-divergences (which gives also convexity of the divergence) but not in general.

2.1 Generic divergences

Definition 10 Divergence
#

A divergence between measures is a function \(D\) which for any measurable space \(\mathcal X\) and any two measures \(\mu , \nu \in \mathcal M(\mathcal X)\), returns a value \(D(\mu , \nu ) \in \mathbb {R} \cup \{ +\infty \} \).

Lean Remark 1
#

In Lean, such a definition for a divergence would have to restrict the possible measurable spaces to be in a common universe. We don’t implement that definition, but will only implement particular divergences.

Definition 11 Data-processing inequality

A divergence \(D\) is said to satisfy the data-processing inequality (DPI) if for all measurable spaces \(\mathcal X, \mathcal Y\), all \(\mu , \nu \in \mathcal M(\mathcal X)\) and all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),

\begin{align*} D(\kappa \circ \mu , \kappa \circ \nu ) \le D(\mu , \nu ) \: . \end{align*}
Lemma 2.1.1 Second marginal

Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\). Then

\begin{align*} D(\kappa \circ \mu , \eta \circ \nu ) \le D(\mu \otimes \kappa , \nu \otimes \eta ) \: . \end{align*}
Proof

Use the DPI for the deterministic kernel of the function \((x,y) \mapsto y\).

Lemma 2.1.2 First marginal

Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then

\begin{align*} D(\mu , \nu ) \le D(\mu \otimes \kappa , \nu \otimes \eta ) \: . \end{align*}
Proof

Use the DPI for the deterministic kernel of the function \((x,y) \mapsto x\).

Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then

\begin{align*} D(\mu \otimes \kappa , \nu \otimes \kappa ) = D(\mu , \nu ) \: . \end{align*}
Proof

Lemma 2.1.2 gives one inequality. The other inequality is the DPI: \(\mu \otimes \kappa = (\mathrm{id} \times \kappa ) \circ \mu \).

Definition 12 Conditional divergence

Let \(D\) be a divergence. The conditional divergence of kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) with respect to a measure \(\mu \in \mathcal M(\mathcal X)\) is \(\mu [x \mapsto D(\kappa (x), \eta (x)]\). It is denoted by \(D(\kappa , \eta \mid \mu )\).

For the important class of \(f\)-divergences that we will introduce later, \(D(\kappa , \eta \mid \mu ) = D(\mu \otimes \kappa , \mu \otimes \eta )\).

Lemma 2.1.4 Conditioning increases divergence

Let \(D\) be a divergence that satisfies the DPI and for which \(D(\kappa , \eta \mid \mu ) = D(\mu \otimes \kappa , \mu \otimes \eta )\). Let \(\mathcal\mu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then

\begin{align*} D(\kappa \circ \mu , \eta \circ \mu ) \le D(\kappa , \eta \mid \mu ) \: . \end{align*}
Proof

This is a special case of Lemma 2.1.1.

2.2 Statistical divergences

We define a family of divergences based on the risk of a binary estimation task.

2.2.1 Statistical information

Definition 13

The statistical information between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(\mathcal I_\xi (\mu , \nu ) = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \mathcal B_\xi (\mu , \nu )\). This is the risk increase \(I_\xi ^P(d_{\mathcal X})\) in the binary hypothesis testing problem for \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) the Markov kernel to the point space.

The statistical information is the difference between the minimal risk of an estimator that does not depend on the data and that of a Bayes estimator. This is a simple generalization of both the DeGroot statistical information as well as the hockey-stick (or \(E_\gamma \)) divergence.

Lemma 2.2.1

For \(\mu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \mu ) = 0\) .

Proof

Use Lemma 1.2.6.

Lemma 2.2.2

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \nu ) \ge 0\) .

Proof

Use Lemma 1.2.8.

Lemma 2.2.3

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \nu ) \le \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} \) .

Proof

Use Lemma 1.2.7.

Lemma 2.2.4

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\xi \in \mathcal M(\{ 0,1\} )\), \(\mathcal I_\xi (\mu , \nu ) = \mathcal I_{\xi _\leftrightarrow }(\nu , \mu )\) .

Proof

Use Lemma 1.2.9.

Theorem 2.2.5 Data-processing inequality

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\xi \in \mathcal M(\{ 0,1\} )\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\mathcal I_\xi (\kappa \circ \mu , \kappa \circ \nu ) \le \mathcal I_\xi (\mu , \nu )\) .

Proof

Since \(\kappa \) is a Markov kernel,

\begin{align*} \mathcal I_\xi (\kappa \circ \mu , \kappa \circ \nu ) & = \min \{ \xi _0(\kappa \circ \mu )(\mathcal X), \xi _1(\kappa \circ \nu )(\mathcal X)\} - \mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \\ & = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \end{align*}

It suffices to prove \(\mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \ge \mathcal B_\xi (\mu , \nu )\) since \(\mathcal I_\xi (\mu , \nu ) = \min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} - \mathcal B_\xi (\mu , \nu )\). The inequality was proved in Theorem 1.2.5.

Alternatively, we can use Lemma 1.1.19.

Corollary 2.2.6

Let \(\mu , \nu \) be two measures on \(\mathcal X\), \(\xi \in \mathcal M(\{ 0,1\} )\) and let \(E\) be an event on \(\mathcal X\). Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\mathcal I_\xi (\mu , \nu ) \ge \mathcal I_\xi (\mu _E, \nu _E)\).

Proof

Apply Theorem 2.2.5 to the deterministic kernel \(\mathcal X \rightsquigarrow \{ 0,1\} \) given by the function \(x \mapsto \mathbb {I}\{ x \in E\} \).

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} - \zeta \left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d\zeta }(x), \xi _1\frac{d \nu }{d\zeta }(x)\right\} \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof

Apply Theorem 1.2.12 to get that result for \(\zeta = P \circ \xi \). Then for other \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) , use \(\frac{d \mu }{d\zeta } = \frac{d \mu }{d(P \circ \xi )} / \frac{d (P \circ \xi )}{d\zeta }\) .

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = - \frac{1}{2} \left\vert \xi _0\mu (\mathcal X) - \xi _1\nu (\mathcal X) \right\vert + \frac{1}{2} \zeta \left[x \mapsto \left\vert \xi _0\frac{d \mu }{d\zeta }(x) - \xi _1\frac{d \nu }{d\zeta }(x)\right\vert \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof

Apply Corollary 1.2.13 and write \(\min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} = \frac{1}{2}\left((P \circ \xi )(\mathcal X) - \left\vert \xi _0\mu (\mathcal X) - \xi _1\nu (\mathcal X) \right\vert \right)\) to get the result for \(\zeta = P \circ \xi \). The result for other \(\zeta \) follows from simple manipulations of Radon-Nikodym derivatives.

Theorem 2.2.9 Integral form of the statistical information

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _0\frac{d \mu }{d\nu }(x) - \xi _1 \right\} \right] + \xi _0 \mu _{\perp \nu }(\mathcal X) & \text{ if } \xi _0 \mu (\mathcal X) \le \xi _1 \nu (\mathcal X) \: , \\ \mathcal I_\xi (\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _1 - \xi _0\frac{d \mu }{d\nu }(x) \right\} \right] & \text{ if } \xi _0 \mu (\mathcal X) \ge \xi _1 \nu (\mathcal X) \: . \end{align*}

For probability measures, this theorem makes the statistical information an \(f\)-divergence (which is defined later in this document).

Proof

By Theorem 1.2.12,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} - (P \circ \xi )\left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d(P \circ \xi )}(x), \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\} \right] \: . \end{align*}

Suppose that \(\xi _0\mu (\mathcal X) \le \xi _1\nu (\mathcal X)\) . Then

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \xi _0\mu (\mathcal X) - (P \circ \xi )\left[x \mapsto \xi _0\frac{d \mu }{d(P \circ \xi )}(x) + \min \left\{ 0 , \xi _1\frac{d \nu }{d(P \circ \xi )}(x) - \xi _0\frac{d \mu }{d(P \circ \xi )}(x)\right\} \right] \\ & = - (P \circ \xi )\left[x \mapsto \min \left\{ 0 , \xi _1\frac{d \nu }{d(P \circ \xi )}(x) - \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \right\} \right] \\ & = (P \circ \xi )\left[x \mapsto \max \left\{ 0 , \xi _0\frac{d \mu }{d(P \circ \xi )}(x) - \xi _1\frac{d \nu }{d(P \circ \xi )}(x) \right\} \right] \\ & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _0\frac{d \mu }{d\nu }(x) - \xi _1 \right\} \right] + \xi _0 \mu _{\perp \nu }(\mathcal X) \: . \end{align*}

If on the other hand \(\xi _0\mu (\mathcal X) \ge \xi _1\nu (\mathcal X)\) , with similar computations,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = (P \circ \xi )\left[x \mapsto \max \left\{ 0 , \xi _1\frac{d \nu }{d(P \circ \xi )}(x) - \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \right\} \right] \\ & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _1 - \xi _0\frac{d \mu }{d\nu }(x) \right\} \right] \: . \end{align*}
Corollary 2.2.10

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = -\frac{1}{2} \left\vert \xi _0 \mu (\mathcal X) - \xi _1 \nu (\mathcal X)\right\vert + \frac{1}{2}\left( \nu \left[ x \mapsto \left\vert \xi _0\frac{d \mu }{d\nu }(x) - \xi _1 \right\vert \right] + \xi _0 \mu _{\perp \nu }(\mathcal X)\right) \: . \end{align*}
Proof

Start from Theorem 2.2.9, then use \(\max \{ a,b\} = \frac{1}{2}\left( a + b + \vert a - b \vert \right)\) . The two cases of Theorem 2.2.9 lead to the same expression.

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \inf _{E \text{ event}} \left( \xi _0 \mu (E) + \xi _1 \nu (E^c) \right) \: . \end{align*}
Proof

This is a direct application of Lemma 1.2.11.

Lemma 2.2.12

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = - \max \{ 0, \xi _1 \nu (\mathcal X) - \xi _0 \mu (\mathcal X) \} + \sup _{E \text{ event}} \left( \xi _1 \nu (E) - \xi _0 \mu (E) \right) \\ & = - \max \{ 0, \xi _0 \mu (\mathcal X) - \xi _1 \nu (\mathcal X) \} + \sup _{E \text{ event}} \left( \xi _0 \mu (E) - \xi _1 \nu (E) \right) \: . \end{align*}
Proof

TODO: find a way to hide the following dummy lemma

Dummy node to summarize properties of the statistical information.

Proof
Definition 14
#

The DeGroot statistical information between finite measures \(\mu \) and \(\nu \) for \(\pi \in [0,1]\) is \(I_\pi (\mu , \nu ) = \mathcal I_{(\pi , 1 - \pi )}(\mu , \nu )\) .

Definition 15
#

The \(E_\gamma \) or hockey-stick divergence between finite measures \(\mu \) and \(\nu \) for \(\gamma \in (0,+\infty )\) is \(E_\gamma (\mu , \nu ) = \mathcal I_{(1,\gamma )}(\mu , \nu )\) .

Note that our definition of the \(E_\gamma \) divergence extends the one from the literature: the divergence is usually defined for \(\gamma \ge 1\) only. The extension makes sense in light of Theorem 2.3.34. It is also extended from probability measures to finite measures.

2.2.2 Total variation distance

Definition 16
#

The total variation distance between finite measures \(\mu \) and \(\nu \) is \(\operatorname{TV}(\mu , \nu ) = \mathcal I_{(1,1)}(\mu , \nu )\) .

This is also equal to \(2 I_{1/2}(\mu , \nu )\) and to \(E_1(\mu , \nu )\).

Lemma 2.2.14
#

For \(\mu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \mu ) = 0\) .

Proof

Use Lemma 2.2.1.

Lemma 2.2.15
#

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) \ge 0\) .

Proof

Use Lemma 2.2.2.

Lemma 2.2.16
#

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) \le \min \{ \mu (\mathcal X), \nu (\mathcal X)\} \) .

Proof

Use Lemma 2.2.3.

Theorem 2.2.17 Data-processing inequality

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\operatorname{TV}(\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{TV}(\mu , \nu )\) .

Proof

Apply Theorem 2.2.5.

TODO: find a way to hide the following dummy lemma

Lemma 2.2.18 Dummy lemma: TV properties

Dummy node to summarize properties of the total variation distance.

Proof
Lemma 2.2.19

For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \min \{ \mu (\mathcal X), \nu (\mathcal X)\} - \zeta \left[x \mapsto \min \left\{ \frac{d \mu }{d\zeta }(x), \frac{d \nu }{d\zeta }(x)\right\} \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof

Apply Lemma 2.2.7.

Lemma 2.2.20

For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \operatorname{TV}(\mu , \nu ) & = - \frac{1}{2} \left\vert \mu (\mathcal X) - \nu (\mathcal X) \right\vert + \frac{1}{2} \zeta \left[x \mapsto \left\vert \frac{d \mu }{d\zeta }(x) - \frac{d \nu }{d\zeta }(x)\right\vert \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof

Apply Lemma 2.2.8.

Lemma 2.2.21 Integral form of the total variation distance

For finite measures \(\mu , \nu \),

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , \frac{d \mu }{d\nu }(x) - 1 \right\} \right] + \mu _{\perp \nu }(\mathcal X) & \text{ if } \mu (\mathcal X) \le \nu (\mathcal X) \: , \\ \operatorname{TV}(\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , 1 - \frac{d \mu }{d\nu }(x) \right\} \right] & \text{ if } \mu (\mathcal X) \ge \nu (\mathcal X) \: . \end{align*}
Proof

Apply Theorem 2.2.9.

Lemma 2.2.22

For finite measures \(\mu , \nu \),

\begin{align*} \operatorname{TV}(\mu , \nu ) & = -\frac{1}{2} \left\vert \mu (\mathcal X) - \nu (\mathcal X)\right\vert + \frac{1}{2}\left( \nu \left[ x \mapsto \left\vert \frac{d \mu }{d\nu }(x) - 1 \right\vert \right] + \mu _{\perp \nu }(\mathcal X)\right) \: . \end{align*}
Proof

Apply Corollary 2.2.10.

Lemma 2.2.23

For finite measures \(\mu , \nu \),

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \min \{ \mu (\mathcal X), \nu (\mathcal X)\} - \inf _{E \text{ event}} \left( \mu (E) + \nu (E^c) \right) \: . \end{align*}
Proof

Apply Lemma 2.2.11.

Theorem 2.2.24

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) with \(\mu (\mathcal X) \leq \nu (\mathcal X)\).
Then \(TV(\mu , \nu ) = \sup _{E \text{ event}} \left( \mu (E) - \nu (E) \right)\).

Proof

Starting from Lemma 2.2.23,

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \min \{ \mu (\mathcal X), \nu (\mathcal X)\} - \inf _{E \text{ event}} \left( \mu (E) + \nu (E^c) \right) \\ & = \mu (\mathcal X) - \inf _{E \text{ event}} \left( \mu (E^c) + \nu (E) \right) \\ & = \mu (\mathcal X) - \mu (\mathcal X) - \inf _{E \text{ event}} \left( -\mu (E) + \nu (E) \right) \\ & = \sup _{E \text{ event}} \left( \mu (E) - \nu (E) \right) \: . \end{align*}
Lemma 2.2.25

Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right) \le \operatorname{TV}(\mu , \nu )\).

Proof

Let \(p,q\) be the respective densities of \(\mu , \nu \) with respect to \(\xi =\mu +\nu \). For any \(f \in \mathcal F\),

\begin{align*} \mu [f] - \nu [f] = \int _x f(x)(p(x) - q(x)) \partial \xi & \le \int _x \Vert f(x) \Vert _\infty \vert p(x) - q(x) \vert \partial \xi \\ & \le \int _x \vert p(x) - q(x) \vert \partial \xi = 2 \operatorname{TV}(\mu , \nu ) \: . \end{align*}

The last equality is Lemma 2.2.20.

Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) = \frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right)\).

Proof

Lemma 2.2.25 gives \(\operatorname{TV}(\mu , \nu ) \ge \frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right)\).

The bayes estimator is a function of the form \(x \mapsto \mathbb {I}\{ x \in E\} \) for some event \(E\), hence it belongs to \(\mathcal F\). This gives equality.

Lemma 2.2.27

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\operatorname{TV}(\mu , \nu ) \ge \operatorname{TV}(\mu _E, \nu _E)\).

Proof

Apply Corollary 2.2.6.

2.3 f-divergences

2.3.1 Definition and basic properties

Everywhere in this document, the functions used in divergences are assumed measurable, continuous and convex on \([0, +\infty )\). For such a function \(f : \mathbb {R} \to \mathbb {R}\), we write \(f'(\infty ) = \lim _{x \to +\infty } f(x)/x\) (which can be infinite).

For \(\mu , \nu \) two measures, we denote by \(\mu _{\perp \nu }\) the singular part of \(\mu \) with respect to \(\nu \).

Definition 17 f-divergence
#

Let \(f : \mathbb {R} \to \mathbb {R}\) and let \(\mu , \nu \) be two measures on a measurable space \(\mathcal X\). The f-divergence between \(\mu \) and \(\nu \) is

\begin{align*} D_f(\mu , \nu ) = \nu \left[x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\right] + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \end{align*}

if \(x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\) is \(\nu \)-integrable and \(+\infty \) otherwise.

Lemma 2.3.1
#

For \(\mu \) and \(\nu \) two finite measures, \(D_f(\mu , \nu )\) is finite if and only if \(x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\) is \(\nu \)-integrable and either \(f'(\infty ) {\lt} \infty \) or \(\mu \ll \nu \).

Proof

Divergences associated with basic (transformations of) functions

Lemma 2.3.2
#

For \(\nu \) a finite measure, for all \(a \in \mathbb {R}\), \(D_{x \mapsto a}(\mu , \nu ) = a \nu (\mathcal X)\).

Proof

Compute the integral.

Lemma 2.3.3
#

If \(f(1) = 0\) then \(D_{f}(\mu , \mu ) = 0\).

Proof

\(\frac{d \mu }{d \mu }(x) = 1\) almost everywhere and \(f(1) = 0\).

Lemma 2.3.4

Let \(\mu , \nu \) be two probability measures. Assume \(f'(\infty ) = + \infty \) and \(f(1) = 0\). Then \(D_f(\mu , \nu ) = 0\) if and only if \(\mu = \nu \).

Proof

If \(\mu = \nu \) then \(D_f(\mu , \nu ) = 0\) by Lemma 2.3.3. For the other direction use the strict version of Jensen’s inequality.

Lemma 2.3.5
#

\(D_{x \mapsto x}(\mu , \nu ) = \mu (\mathcal X)\).

Proof

Compute the integral: its value is \((\frac{d\mu }{d\nu }\cdot \nu )(\mathcal X)\). Then \(D_{x\mapsto x}(\mu , \nu ) = (\frac{d\mu }{d\nu }\cdot \nu )(\mathcal X) + \mu _{\perp \nu }(\mathcal X) = \mu (\mathcal X)\).

Lemma 2.3.6
#

For all \(a \ge 0\), \(D_{a f}(\mu , \nu ) = a D_{f}(\mu , \nu )\).

Proof

Linearity of the integral.

Lemma 2.3.7
#

\(D_{f + g}(\mu , \nu ) = D_f(\mu , \nu ) + D_g(\mu , \nu )\).

Proof

Linearity of the integral.

For finite measures \(\mu \) and \(\nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\), for all \(a \in \mathbb {R}\), \(D_{f + a(x - 1)}(\mu , \nu ) = D_{f}(\mu , \nu )\).

Proof

Linearity (Lemmas 2.3.7 and 2.3.6), then Lemma 2.3.2 and 2.3.5.

Lemma 2.3.9

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X\) and let \(g : \mathcal X \to \mathcal Y\) be a measurable embedding. Then \(D_f(g_* \mu , g_* \nu ) = D_f(\mu , \nu )\).

Proof
Lemma 2.3.10

\(D_f(\mu , \nu ) = D_{x \mapsto xf(1/x)}(\nu , \mu )\) .

Proof
Lemma 2.3.11

\((\mu , \nu ) \mapsto D_f(a \mu + b \nu , \nu )\) is an \(f\)-divergence for the function \(x \mapsto f(ax + b)\)

Proof

\((\mu , \nu ) \mapsto D_f(\mu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{x}{ax+b}\right)\) .

Proof

We use Lemma 2.3.10, then Lemma 2.3.11 and finally Lemma 2.3.10 again:

\begin{align*} D_f(\mu , a \mu + b \nu ) & = D_{xf(1/x)}(a \mu + b \nu , \mu ) \\ & = D_{(bx+a)f(1/(bx+a))}(\nu , \mu ) \\ & = D_{(ax+b)f(x/(ax+b))}(\mu , \nu ) \: . \end{align*}

\((\mu , \nu ) \mapsto D_f(\nu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{1}{ax+b}\right)\) .

Proof

We use Lemma 2.3.12, then Lemma 2.3.10.

\begin{align*} D_f(\nu , a \mu + b \nu ) & = D_{(bx+a)f\left(\frac{x}{bx+a}\right)}(\nu , \mu ) = D_{(ax+b)f\left(\frac{1}{ax+b}\right)}(\mu , \nu ) \: . \end{align*}

Upper and lower bounds on \(D_f\)

Let \(\mu _1, \mu _2\) and \(\nu \) be finite measures on \(\mathcal X\), with \(\mu _1 \ll \nu \) and \(\mu _2 \perp \nu \). Then \(D_f(\mu _1 + \mu _2, \nu ) = D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).

Proof

\(\frac{d(\mu _1 + \mu _2)}{d \nu } = \frac{d \mu _1}{d \nu }\) a.e. and \((\mu _1 + \mu _2)_{\perp \nu } = \mu _2\).

Lemma 2.3.15 Superseded by Lemma 2.3.16

Let \(\mu _1, \mu _2, \nu \) be three finite measures on \(\mathcal X\) with \(\mu _1 \ll \nu \) and \(\mu _2 \ll \nu \). Then \(D_f(\mu _1 + \mu _2, \nu ) \le D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).

Proof
\begin{align*} D_f(\mu _1 + \mu _2, \nu ) & = \int _x f \left( \frac{d \mu _1}{d\nu }(x) + \frac{d\mu _2}{d\nu }(x) \right) \partial \nu \\ & \le \int _x f \left( \frac{d \mu _1}{d\nu }(x) \right) + \frac{d\mu _2}{d\nu }(x) f’(\infty ) \partial \nu \\ & = D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f’(\infty ) \: . \end{align*}
Lemma 2.3.16

Let \(\mu _1, \mu _2, \nu \) be three finite measures on \(\mathcal X\). Then \(D_f(\mu _1 + \mu _2, \nu ) \le D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).

Proof

From Lemma 2.3.14, then Lemma 2.3.15,

\begin{align*} D_f(\mu _1 + \mu _2, \nu ) & = D_f(\frac{d\mu _1}{d \nu }\cdot \nu + \frac{d\mu _2}{d \nu }\cdot \nu , \nu ) + (\mu _1)_{\perp \nu }(\mathcal X) f’(\infty ) + (\mu _2)_{\perp \nu }(\mathcal X) f’(\infty ) \\ & \le D_f(\frac{d\mu _1}{d \nu }\cdot \nu , \nu ) + (\frac{d\mu _2}{d \nu }\cdot \nu )(\mathcal X) f’(\infty ) + (\mu _1)_{\perp \nu }(\mathcal X) f’(\infty ) + (\mu _2)_{\perp \nu }(\mathcal X) f’(\infty ) \\ & = D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f’(\infty ) \: . \end{align*}
Lemma 2.3.17

Let \(\mu \) and \(\nu \) be two finite measures on \(\mathcal X\). Then \(D_f(\mu , \nu ) = D_f(\frac{d\mu }{d\nu }\cdot \nu , \nu ) + f'(\infty ) \mu _{\perp \nu }(\mathcal X)\).

Proof

Apply Lemma 2.3.14 to \(\frac{d\mu }{d\nu }\cdot \nu \) and \(\mu _{\perp \nu }\).

Lemma 2.3.18 Superseded by Lemma 2.3.19
#

Let \(\mu \) be a finite measure and \(\nu \) be a probability measure on the same space \(\mathcal X\), such that \(\mu \ll \nu \). Then \(f(\mu (\mathcal X)) \le D_f(\mu , \nu )\).

Proof

Since \(\mu \ll \nu \) the \(f\)-divergence is only the integral part. Then by Jensen’s inequality,

\begin{align*} D_f(\mu , \nu ) & = \nu \left[ f\left( \frac{d\mu }{d\nu } \right) \right] \ge f\left( \nu \left[\frac{d\mu }{d\nu } \right] \right) = f(\mu (\mathcal X)) \: . \end{align*}
Lemma 2.3.19

Let \(\mu \) be a finite measure and \(\nu \) be a probability measure on the same space \(\mathcal X\). Then \(f(\mu (\mathcal X)) \le D_f(\mu , \nu )\).

Proof

By convexity, then Lemma 2.3.18 and finally Lemma 2.3.17,

\begin{align*} f(\mu (\mathcal X)) & \le f(\frac{d\mu }{d\nu }\cdot \nu (\mathcal X)) + f’(\infty )\mu _{\perp }(\mathcal X) \\ & \le D_f(\frac{d\mu }{d\nu }\cdot \nu , \nu ) + f’(\infty )\mu _{\perp }(\mathcal X) \\ & = D_f(\mu , \nu ) \: . \end{align*}
Lemma 2.3.20

Let \(\mu , \nu \) be two probability measures. If \(f(1) = 0\) then \(D_f(\mu , \nu ) \ge 0\).

Proof

Apply Lemma 2.3.19 and use \(f(\mu (\mathcal X)) = f(1) = 0\).

2.3.2 Conditional f-divergence

TODO: replace the definition by \(D(\mu \otimes \kappa , \mu \otimes \eta )\), which allows for a more general proof of the “conditioning increases divergence” inequality.

Definition 18 Conditional f-divergence
#

Let \(f : \mathbb {R} \to \mathbb {R}\), \(\mu \) a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels from \(\mathcal X\) to \(\mathcal Y\). The conditional f-divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is

\begin{align*} D_f(\kappa , \eta \mid \mu ) = \mu \left[x \mapsto D_f(\kappa (x), \eta (x))\right] \end{align*}

if \(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.

Lemma 2.3.21

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels. If \(f(1) = 0\) then \(D_f(\kappa , \eta \mid \mu ) \ge 0\).

Proof

Apply Lemma 2.3.20.

Lemma 2.3.22

Let \(\mu , \nu \) be measures on \(\mathcal X\), where \(\mu \) is finite, and let \(\xi \) be a finite measure on \(\mathcal Y\). Then \(D_f(x \mapsto \mu , x \mapsto \nu \mid \xi ) = D_f(\mu , \nu ) \xi (\mathcal X)\).

Proof
\[ D_f(x \mapsto \mu , x \mapsto \nu \mid \xi ) = \xi \left[x \mapsto D_f(\mu , \nu )\right] = D_f(\mu , \nu ) \xi \left[1\right] = D_f(\mu , \nu ) \xi (\mathcal X) \]
Lemma 2.3.23

Let \(\mu \) be a finite measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels from \(\mathcal X\) to \(\mathcal Y\). Then \(p \mapsto f \left(\frac{d(\mu \otimes \kappa )}{d(\mu \otimes \eta )}(p)\right)\) is \((\mu \otimes \eta )\)-integrable iff

  • \(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and

  • for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable.

Proof

Since \(x \mapsto f \left(\frac{d(\mu \otimes \kappa )}{d(\mu \otimes \eta )} x \right)\) is measurable, the integrability condition w.r.t. \(\mu \otimes \eta \) is equivalent to

  • \(x \mapsto \int _y \left\Vert f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \right\Vert \partial \eta (x)\) is \(\mu \)-integrable and

  • for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable.

It suffices to use convexity to show that integrability of \(x \mapsto \int _y f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \partial \eta (x)\) implies integrability of \(x \mapsto \int _y \left\Vert f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \right\Vert \partial \eta (x)\).

TODO

Lemma 2.3.24

Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, where \(\kappa \) is a Markov kernel. Then \(D_f(\kappa , \eta \mid \mu ) \ne \infty \) if and only if

  • for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable,

  • \(x \mapsto \int _y f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \partial \eta (x)\) is \(\mu \)-integrable,

  • either \(f'(\infty ) {\lt} \infty \) or for \(\mu \)-almost all \(x\), \(\kappa (x) \ll \eta (x)\).

Proof

Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\mu \otimes \kappa , \mu \otimes \eta ) \ne \infty \iff D_f(\kappa , \eta \mid \mu ) \ne \infty \).

Proof

Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\).

Proof

By Lemma 2.3.25, the conditions on which the two divergences are finite are the same. We then assume integrability properties such that both are finite (given by Lemma 2.3.24).

TODO: the following proof assumes \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\). Describe the general case.

By Lemma B.1.6 and Corollary B.0.2,

\begin{align*} D_f(\mu \otimes \kappa , \mu \otimes \eta ) & = \int _{p} f\left(\frac{d (\mu \otimes \kappa )}{d (\mu \otimes \eta )}(p)\right) \partial (\mu \otimes \eta ) \\ & = \int _{p} f\left(\frac{d \kappa }{d \eta }(p)\right) \partial (\mu \otimes \eta ) \\ & = \int _x \int _y f\left(\frac{d \kappa }{d \eta }(x, y)\right) \partial \eta (x) \partial \mu \\ & = \int _x \int _y f\left(\frac{d \kappa (x)}{d \eta (x)}(y)\right) \partial \eta (x) \partial \mu \\ & = \mu \left[D_f(\kappa (x), \eta (x))\right] = D_f(\kappa , \eta \mid \mu ) \: . \end{align*}

TODO: find a way to hide the following dummy lemma

Proof

2.3.3 Integral representation of \(f\)-divergences

We prove that any \(f\)-divergence is an integral over priors of statistical informations, in which the measure depends on \(f\). The main references for this part are [ LV06 ] and [ Lie12 ] .

Definition 19 Curvature measure
#

Let \(f: \mathbb {R} \to \mathbb {R}\) be a convex function. Then its right derivative \(f'_+(x) \coloneqq \lim _{y \downarrow x}\frac{f(y) - f(x)}{y - x}\) is a Stieltjes function (a monotone right continuous function) and it defines a measure \(\gamma _f\) on \(\mathbb {R}\) by \(\gamma _f((x,y]) \coloneqq f'_+(y) - f'_+(x)\) . [ Lie12 ] calls \(\gamma _f\) the curvature measure of \(f\).

Lemma 2.3.28

For \(a \ge 0\) and \(f: \mathbb {R} \to \mathbb {R}\) a convex function, the curvature measure of \(af\) is \(\gamma _{af} = a \gamma _f\) .

Proof
Lemma 2.3.29

For \(f,g: \mathbb {R} \to \mathbb {R}\) two convex functions, the curvature measure of \(f+g\) is \(\gamma _{f+g} = \gamma _f + \gamma _g\) .

Proof
Theorem 2.3.30

If \(f\) and \(g\) are two Stieltjes functions with associated measures \(\mu _f\) and \(\mu _g\) and \(f\) is continuous on \([a, b]\), then

\begin{align*} \int _x f(x) \partial \mu _g = f(b)g(b) - f(a)g(a) - \int _x g(x) \partial \mu _f \: . \end{align*}
Proof
Lemma 2.3.31
#

For \(f: \mathbb {R} \to \mathbb {R}\) a convex function and \(x,y \in \mathbb {R}\),

\begin{align*} f(y) - f(x) - (y - x)f’_+(x) & = \int _{z \in (x,y]} (y - z) \partial \gamma _f & \text{ if } x \le y \: , \\ f(y) - f(x) - (y - x)f’_+(x) & = \int _{z \in (y,x]} (z - y) \partial \gamma _f & \text{ if } y \le x \: . \end{align*}
Proof

Let \(\Lambda \) be the Lebesgue measure and let \(x {\lt} y\). Since \(f\) has right derivative \(f'_+\) in \((x,y)\) and \(f'_+\) is integrable on that interval,

\begin{align*} f(y) - f(x) = \int _{z \in (x, y]} f’_+(z) \partial \Lambda \: . \end{align*}

We now integrate by parts, using Theorem 2.3.30. \(\Lambda \) is the measure obtained from the Stieltjes function \(z \mapsto z - y\).

\begin{align*} \int _{z \in (x, y]} f’_+(z) \partial \Lambda & = - \int _{z \in (x,y]} (z - y)\partial \gamma _f + f’_+(y)(y - y) - f’_+(x)(x - y) \\ & = \int _{z \in (x,y]} (y - z)\partial \gamma _f + f’_+(x)(y - x) \: . \end{align*}

Putting the two equalities together, we get the result for \(x {\lt} y\). The proof for \(x {\gt} y\) is similar, and for \(x = y\) both sides of both equations are zero.

Definition 20
#

For \(a,b \in (0, +\infty )\) let \(\phi _{a,b} : \mathbb {R} \to \mathbb {R}\) be the function defined by

\begin{align*} \phi _{a,b}(x) & = \max \left\{ 0, a x - b \right\} & \text{ for } a \le b \: , \\ \phi _{a,b}(x) & = \max \left\{ 0, b - a x \right\} & \text{ for } a {\gt} b \: . \end{align*}

For \(f: \mathbb {R} \to \mathbb {R}\) a convex function, for all \(x \in \mathbb {R}\) ,

\begin{align*} f(x) & = f(1) + f’_+(1) (x - 1) + \int _{y} \phi _{1,y}(x) \partial \gamma _f \: . \end{align*}
Proof

By Lemma 2.3.31, for \(x {\gt} 1\),

\begin{align*} f(x) - f(1) - f’_+(1) (x - 1) & = \int _{y \in (1, x]} (x - y) \partial \gamma _f \\ & = \int _{y \in (1, +\infty )} \max \{ 0, x - y\} \partial \gamma _f \\ & = \int _{y \in (1, +\infty )} \phi _{1,y}(x) \partial \gamma _f \: . \end{align*}

That final integral is equal to \(\int _y \phi _{1,y}(x) \partial \gamma _f\) (without the restriction to \((1, +\infty )\)) because \(\phi _{1,y}(x)\) is zero outside of \((1, +\infty )\) for \(x \ge 1\).

The case of \(x {\lt} 1\) is similar.

Lemma 2.3.33

The curvature measure of the function \(\phi _{a,b}\) is \(\gamma _{\phi _{a,b}} = a\delta _{b/a}\) , where \(\delta _x\) is the Dirac measure at \(x\).

Proof

The following theorem is inspired from [ LV06 , Lie12 ] but gives a different integral representation.

For two finite measures \(\mu , \nu \in \mathcal M(\mathcal X)\),

\begin{align*} D_f(\mu , \nu ) = f(1) \nu (\mathcal X) + f’_+(1)(\mu (\mathcal X) - \nu (\mathcal X)) + \int _y D_{\phi _{1,y}}(\mu , \nu ) \partial \gamma _f \: . \end{align*}

For probability measures, \(D_{\phi _{1,y}}(\mu , \nu ) = E_y(\mu , \nu )\) , but these functions can differ if \(\mu \) and \(\nu \) are finite measures with different masses.

Proof

We prove the result for \(f(1) = 0\) and \(f'_+(1) = 0\) for simplicity of exposition.

From Corollary 2.3.32 we get that \(f(x) = \int _y \phi _{1,y}(x) \partial \gamma _f\) . From this and the theorem of Fubini,

\begin{align*} \int _x f\left( \frac{d \mu }{d \nu }(x) \right) \partial \nu = \int _{x} \int _{y} \phi _{1,y} \left(\frac{d \mu }{d \nu }(x) \right) \partial \gamma _f \partial \nu = \int _{y} \int _{x} \phi _{1,y} \left(\frac{d \mu }{d \nu }(x) \right) \partial \nu \partial \gamma _f \: . \end{align*}

For the mutually singular part of the divergence \(D_f\),

\begin{align*} f’(\infty ) = f’_+(\infty ) - f’_+(1) = \gamma _f((1,+\infty )) = \int _{y {\gt} 1} 1 \partial \gamma _f = \int _{y {\gt} 1} \phi _{1,y}’(\infty ) \partial \gamma _f \: . \end{align*}

The last integral is equal to \(\int _{y \in \mathbb {R}} \phi _{1,y}'(\infty ) \partial \gamma _f\) since \(\phi _{1,y}'(\infty ) = 0\) for \(y \le 1\). Finally,

\begin{align*} D_f(\mu , \nu ) & = \int _x f\left( \frac{d \mu }{d \nu }(x) \right) \partial \nu + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \\ & = \int _{y} \left(\int _{x} \phi _{1,y} \left(\frac{d \mu }{d \nu }(x) \right) \partial \nu + \phi _{1,y}’(\infty ) \mu (\mathcal X) \right) \partial \gamma _f \\ & = \int _y D_{\phi _{1,y}}(\mu , \nu ) \partial \gamma _f \: . \end{align*}

TODO: define \(\Gamma _f\).

TODO: in the lemma below, \(I_\pi (\mu , \nu )\) is the correct value for probability measures, but in general it should be replaced by \(D_{\phi _{\pi , 1 - \pi }}(\mu , \nu )\) . These functions can differ if \(\mu \) and \(\nu \) are finite measures with different masses.

\begin{align*} D_f(\mu , \nu ) = f(1) \nu (\mathcal X) + f’_+(1)(\mu (\mathcal X) - \nu (\mathcal X)) + \int _\pi I_\pi (\mu , \nu ) \partial \Gamma _f \: . \end{align*}
Proof

2.3.4 Data-processing inequality

This section contains three proofs of the data-processing inequality for \(f\)-divergences. That inequality states that for \(\mu , \nu \) two measures on \(\mathcal X\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).

  1. A first proof uses the convexity of \(f\) through Jensen’s inequality for conditional expectations. We first prove the DPI for measurable functions (or deterministic kernels) then use it to extend to the general case.

  2. A second proof deals with the general case and uses Radon-Nikodym derivatives of kernels. This restrict its applicability to measurable spaces that satisfy some particular countability assumptions.

  3. The last proof uses the representation of \(f\)-divergence as an integral over statistical (risk-based) divergences, then uses the data-processing inequality for those divergences.

Lean Remark 2
#

Currently we don’t have Lean proofs of Jensen’s inequality for conditional expectations (used in proof 1) or of an integration by part lemma that is used in proof 3. Proof 2 gives the weakest results since it is restricted to spaces that satisfy countability assumptions, but it’s the only one for which our Lean proof is sorry-free.

We first remark that it is sufficient to prove the DPI under an absolute continuity assumption on the measures.

Lemma 2.3.36

Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel and \(\nu \in \mathcal M(\mathcal X)\) be a finite measure. Suppose that for all finite measures \(\mu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \), \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\). Then the same is true without the absolute continuity hypothesis.

Proof

We use \(\kappa \circ \mu = \kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ) + \kappa \circ \mu _{\perp \nu }\) . By Lemma 2.3.16,

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f\left(\kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ), \kappa \circ \nu \right) + (\kappa \circ \mu _{\perp \nu })(\mathcal Y) f’(\infty ) \: . \end{align*}

Since \(\kappa \) is a Markov kernel, \((\kappa \circ \mu _{\perp \nu })(\mathcal Y) = \mu _{\perp \nu }(\mathcal Y)\).

By hypothesis, since \(\frac{d\mu }{d\nu }\cdot \nu \ll \nu \), \(D_f\left(\kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ), \kappa \circ \nu \right) \le D_f\left(\frac{d\mu }{d\nu }\cdot \nu , \nu \right)\). We get

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) & \le D_f\left(\frac{d\mu }{d\nu }\cdot \nu ,\nu \right) + \mu _{\perp \nu }(\mathcal Y) f’(\infty ) \\ & = D_f(\mu , \nu ) \: . \end{align*}

The next Lemma implies that in order to prove the DPI, it is sufficient to prove \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \kappa )\).

Lemma 2.3.37 Composition-product with a kernel

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\mu \otimes \kappa , \nu \otimes \kappa ) = D_f(\mu , \nu )\).

Proof

By Lemma B.1.4 and the fact that \((\mu \otimes \kappa )_{\perp \nu \otimes \kappa } = \mu _{\perp \nu } \otimes \kappa \) (Lemma TODO),

\begin{align*} D_f(\mu \otimes \kappa , \nu \otimes \kappa ) & = \int _{p} f \left(\frac{d (\mu \otimes \kappa )}{d (\nu \otimes \kappa )}(p)\right) \partial (\nu \otimes \kappa ) + f’(\infty ) (\mu \otimes \kappa )_{\perp \nu \otimes \kappa }(\mathcal X) \\ & = \int _{p} f \left(\frac{d \mu }{d \nu }(p_X)\right) \partial (\nu \otimes \kappa ) + f’(\infty ) (\mu _{\perp \nu } \otimes \kappa )(\mathcal X) \\ & = \int _x \int _y f \left(\frac{d \mu }{d \nu }(x)\right) \partial \kappa (x) \partial \nu + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \\ & = \int _x f \left(\frac{d \mu }{d \nu }(x)\right) \partial \nu + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \\ & = D_f(\mu , \nu ) \: . \end{align*}
Corollary 2.3.38

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\xi \) be a measure on \(\mathcal Y\). Then \(D_f(\mu \times \xi , \nu \times \xi ) = D_f(\mu , \nu )\).

Proof

Apply Lemma 2.3.37 with \(\kappa \) the constant kernel with value \(\xi \).

DPI: proof using Jensen’s inequality for the conditional expectation

This proof lifts the DPI on deterministic kernels to a DPI on all kernels. We first prove the DPI for the special cases of a composition with a deterministic kernel, which is the same as taking the map of the measures by a function.

Theorem 2.3.39

Let \(\mu , \nu \in \mathcal M (\mathcal X)\) be two finite measures and let \(g : \mathcal X \to \mathcal Y\) be a measurable function. Then \(D_f(g_* \mu , g_* \nu ) \le D_f(\mu , \nu )\).

Proof

By Lemma 2.3.36, it is sufficient to prove the inequality with the additional hypothesis \(\mu \ll \nu \) (hence also \(g_* \mu \ll g_*\nu \)). In that case,

\begin{align*} D_f(g_* \mu , g_* \nu ) & = \int _y f \left( \frac{d g_* \mu }{d g_* \nu }(y) \right) \partial (g_* \nu ) \\ & = \int _{x} f \left( \frac{d g_* \mu }{d g_* \nu }(g(x)) \right) \partial \nu \: . \end{align*}

By Lemma B.2.1, \(\nu \)-almost everywhere,

\begin{align*} \frac{d g_* \mu }{d g_* \nu }(g(x)) & = \nu \left[\frac{d \mu }{d \nu } \mid g^* \mathcal Y \right](x) \: . \end{align*}

Using that expression, then Jensen’s inequality for the conditional expectation(Theorem C.0.2), we get

\begin{align*} D_f(g_* \mu , g_* \nu ) & = \int _{x} f \left( \nu \left[\frac{d \mu }{d \nu } \mid g^* \mathcal Y \right](x) \right) \partial \nu \\ & \le \int _{x} \nu \left[ f \left( \frac{d \mu }{d \nu } \right) \mid g^* \mathcal Y \right](x) \partial \nu \: . \end{align*}

The integral of a conditional expectation is equal to the integral of the function, hence

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) & \le \int _{x} f \left( \frac{d \mu }{d \nu }(x) \right) \partial \nu \: . \end{align*}

This is equal to \(D_f(\mu , \nu )\).

Theorem 2.3.40 Data-processing

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).

Proof

Let \(\pi _Y : \mathcal X \times \mathcal Y \to \mathcal Y\) be the projection \(\pi _Y((x,y)) = y\). Then \(\kappa \circ \mu = \pi _{Y*}(\mu \otimes \kappa )\) and similarly for \(\nu \). We can use the DPI for that function (Theorem 2.3.39), then Lemma 2.3.37,

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) & = D_f(\pi _{Y*}(\mu \otimes \kappa ), \pi _{Y*}(\nu \otimes \kappa )) \\ & \le D_f(\mu \otimes \kappa , \nu \otimes \kappa ) \\ & = D_f(\mu , \nu ) \: . \end{align*}

DPI: proof using derivatives of kernels

This proof does not use Jensen’s inequality for the conditional expectation, but only for the ordinary expectation. It starts by establishing \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\) for Markov kernels \(\kappa \) and \(\eta \), then deduces the DPI. Indeed, if the measurable spaces are standard Borel, we can write \(\mu \otimes \kappa = ((\kappa \circ \mu ) \otimes \kappa ^\dagger _\mu )_\leftrightarrow \) for a Markov kernel \(\kappa ^\dagger _\mu \) (see Definition 48; \((...)_\leftrightarrow \) is notation for composition with the map that swaps the coordinates) and then use the inequality to get

\begin{align*} D(\kappa \circ \mu , \kappa \circ \nu ) & \le D((\kappa \circ \mu ) \otimes \kappa ^\dagger _\mu , (\kappa \circ \nu ) \otimes \kappa ^\dagger _\nu ) \\ & = D(((\kappa \circ \mu ) \otimes \kappa ^\dagger _\mu )_{\leftrightarrow }, ((\kappa \circ \nu ) \otimes \kappa ^\dagger _\nu )_{\leftrightarrow }) \\ & = D(\mu \otimes \kappa , \nu \otimes \kappa ) \\ & = D(\mu , \nu ) \: . \end{align*}

The main drawback of this approach is that it requires that the Bayesian inverse exists, which is not always the case, unless for example the measurable spaces are standard Borel.

The inequality \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\) is obtained by describing the Radon-Nikodym derivative \(\frac{d(\mu \otimes \kappa )}{d(\nu \otimes \eta )}\) and using convexity properties of the function \(f\).

Theorem 2.3.41

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Proof

TODO: the following proof assumes \(\mu \ll \nu \) and \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\). Describe the general case.

Using Lemma B.1.5 and B.0.2,

\begin{align*} D(\mu \otimes \kappa , \nu \otimes \eta ) & = \int _x \int _y f \left( \frac{\partial \mu }{\partial \nu }(x) \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \right) \partial \eta (x) \partial \nu \: . \end{align*}

Since \(f\) is convex, by Jensen’s inequality,

\begin{align*} \int _y f \left( \frac{\partial \mu }{\partial \nu }(x) \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \right) \partial \eta (x) & \ge f \left( \int _y \frac{\partial \mu }{\partial \nu }(x) \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \partial \eta (x) \right) \\ & = f \left( \frac{\partial \mu }{\partial \nu }(x) \int _y \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \partial \eta (x) \right) \: . \end{align*}

Since \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\), \(\int _y \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \partial \eta (x) = \int _y 1 \partial \kappa (x) = 1\) a.e.. We have obtained

\begin{align*} D_f(\mu \otimes \kappa , \nu \otimes \eta ) \ge \int _x f \left( \frac{\partial \mu }{\partial \nu }(x)\right) \partial \nu = D_f(\mu , \nu ) \: . \end{align*}
Theorem 2.3.42 Marginals

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\) where \(\mathcal Y\) is standard Borel, and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mathcal X\) standard Borel and \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).

Proof

We introduce conditional kernels and write \(D(\mu , \nu ) = D(\mu _X \otimes \mu _{Y|X}, \nu _X \otimes \nu _{Y|X})\). Then apply Theorem 2.3.41. For the second statement, we need Lemma 2.3.9 to swap the role of the two coordinates.

Lemma 2.3.43

Let \(\mu , \nu \) be two finite measures on a standard Borel space \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\)

Proof

By definition, \(\kappa \circ \mu \) is the marginal of \(\mu \otimes \kappa \) (a measure on \(\mathcal X \times \mathcal Y\)) on \(\mathcal Y\), and similarly for the other measure. Hence by Theorem 2.3.42, \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Theorem 2.3.44 Conditioning increases f-divergence

Let \(\mu \) be a measure on a standard Borel space \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\)

Proof

By Lemma 2.3.43, \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta )\). This is equal to \(D_f(\kappa , \eta \mid \mu )\) by Lemma 2.3.26.

Theorem 2.3.45 Data-processing

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel, where both \(\mathcal X\) and \(\mathcal Y\) are standard Borel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).

Proof

By Lemma 2.3.43, \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \kappa )\). Then the latter is equal to \(D_f(\mu , \nu )\) by Lemma 2.3.37.

DPI: proof using the integral representation

This proof uses that every \(f\)-divergence is an integral over \(\gamma \ge 0\) of \(f\)-divergences for the function \(\phi _{1,\gamma }\) of Definition 20. It suffices then to prove the DPI for those particular \(f\)-divergences. Those are equal to the \(E_\gamma \) divergence, up to a term that is invariant by composition with a Markov kernel. Finally, we already know that the \(E_\gamma \) divergence satisfies the DPI, and that completes the proof.

Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).

\begin{align*} D_{\phi _{a,b}}(\mu , \nu ) & = \text{sign}(b-a)\frac{1}{2}(a \mu (\mathcal X) - b \nu (\mathcal X)) + \frac{1}{2}\nu \left[ \left\vert a \frac{d\mu }{d\nu } - b \right\vert \right] + \frac{1}{2}a \mu _{\perp \nu }(\mathcal X) \: , \end{align*}

in which \(\text{sign}(b-a)\) is \(1\) if \(b-a {\gt} 0\) and \(-1\) if \(b-a \le 0\).

Proof

If \(a \le b\),

\begin{align*} D_{\phi _{a,b}}(\mu , \nu ) & = \nu \left[ \max \{ 0, a \frac{d\mu }{d\nu } - b\} \right] + a \mu _{\perp \nu }(\mathcal X) \\ & = \frac{1}{2}(a \mu _{\parallel \nu }(\mathcal X) - b \nu (\mathcal X)) + \frac{1}{2}\nu \left[ \left\vert a \frac{d\mu }{d\nu } - b \right\vert \right] + a \mu _{\perp \nu }(\mathcal X) \\ & = \frac{1}{2}(a \mu (\mathcal X) - b \nu (\mathcal X)) + \frac{1}{2}\nu \left[ \left\vert a \frac{d\mu }{d\nu } - b \right\vert \right] + \frac{1}{2}a \mu _{\perp \nu }(\mathcal X) \: . \end{align*}

Similar computations lead to the result for \(b \le a\).

Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).

\begin{align*} D_{\phi _{a,b}}(\mu , \nu ) = \mathcal I_{(a,b)}(\mu , \nu ) + \frac{1}{2} \left\vert a \mu (\mathcal X) - b \nu (\mathcal X) \right\vert + \text{sign}(b-a)\frac{1}{2}(a \mu (\mathcal X) - b \nu (\mathcal X)) \: . \end{align*}
Proof

Combine Lemma 2.3.46 and Corollary 2.2.10.

Let \(a,b \in [0, +\infty )\). Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_{\phi _{a,b}}(\kappa \circ \mu , \kappa \circ \nu ) \le D_{\phi _{a,b}}(\mu , \nu )\).

Proof

By Corollary 2.3.47,

\begin{align*} D_{\phi _{a,b}}(\mu , \nu ) = \mathcal I_{(a,b)}(\mu , \nu ) + \frac{1}{2} \left\vert a \mu (\mathcal X) - b \nu (\mathcal X) \right\vert + \text{sign}(b-a)\frac{1}{2}(a \mu (\mathcal X) - b \nu (\mathcal X)) \: . \end{align*}

The statistical information satisfies the data-processing inequality (Theorem 2.2.5) and the other terms are invariant by composition with a Markov kernel, hence \(D_{\phi _{a,b}}\) also satisfies the data-processing inequality.

Theorem 2.3.49 Data-processing

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).

Proof

By Theorem 2.3.34,

\begin{align*} D_f(\mu , \nu ) = f(1) \nu (\mathcal X) + f’_+(1) (\mu (\mathcal X) - \nu (\mathcal X)) + \int _y D_{\phi _{1,y}}(\mu , \nu ) \partial \gamma _f \: . \end{align*}

The first two terms are unchanged by composition with a Markov kernel. It thus suffices to prove the data processing inequality for the divergences \(D_{\phi _{1, \gamma }}\). This was done in Lemma 2.3.48.

2.3.5 Sigma-algebras

For \(\mathcal A\) a sub-\(\sigma \)-algebra and \(\mu \) a measure, we write \(\mathcal\mu _{| \mathcal A}\) for the measure restricted to the \(\sigma \)-algebra. The measure \(\mathcal\mu _{| \mathcal A}\) is equal to the map of \(\mu \) by the identity, seen as a function to \(\mathcal X\) with the \(\sigma \)-algebra \(\mathcal A\). We can thus get a DPI for sub-sigma-algebras by using the DPI for maps.

Theorem 2.3.50

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\mathcal A\) be a sub-\(\sigma \)-algebra of \(\mathcal X\). Then \(D_f(\mu _{| \mathcal A}, \nu _{| \mathcal A}) \le D_f(\mu , \nu )\).

Proof

The measure \(\mathcal\mu _{| \mathcal A}\) is equal to the map of \(\mu \) by the identity, seen as a function to \(\mathcal X\) with the \(\sigma \)-algebra \(\mathcal A\). That function is measurable since \(\mathcal A\) is a sub-sigma-algebra. We can apply Theorem 2.3.39.

Theorem 2.3.51

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\). Then \(\sup _{\mathcal A \text{ finite}} D_f(\mu _{| \mathcal A}, \nu _{| \mathcal A}) = D_f(\mu , \nu )\).

Proof
Lemma 2.3.52

Let \(\mu , \nu \in \mathcal M(\mathcal X)\) be finite measures with \(\mu \ll \nu \) and let \(g : \mathcal X \to \mathcal Y\) be a measurable function. Denote by \(g^* \mathcal Y\) the comap of the \(\sigma \)-algebra on \(\mathcal Y\) by \(g\). Then \(D_f(g_* \mu , g_* \nu ) = D_f(\mu _{| g^* \mathcal Y}, \nu _{| g^* \mathcal Y})\) .

Proof

Using Lemma B.3.2,

\begin{align*} D_f(g_* \mu , g_* \nu ) & = \int _y f \left( \frac{d g_*\mu }{d g_*\nu }(y)\right) \partial (g_*\nu ) \\ & = \int _x f \left( \frac{d g_*\mu }{d g_*\nu }(g(x))\right) \partial \nu \\ & = \int _x f \left(\frac{d \mu _{| g^* \mathcal Y}}{d \nu _{| g^* \mathcal Y}}(x) \right) \partial \nu \\ & = \int _x f \left(\frac{d \mu _{| g^* \mathcal Y}}{d \nu _{| g^* \mathcal Y}}(x) \right) \partial \nu _{| g^* \mathcal Y} \\ & = D_f(\mu _{| g^* \mathcal Y}, \nu _{| g^* \mathcal Y}) \: . \end{align*}

2.3.6 Consequences of the data-processing inequality

Direct applications

Theorem 2.3.53

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels. Then \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Proof

Since \(\kappa \) is a Markov kernel, \(\mu \) is the composition of \(\mu \otimes \kappa \) and the deterministic kernel for the function \((x,y) \mapsto x\). The same applies for \(\nu \) and \(\nu \otimes \eta \). The result is then an application of the data-processing inequality, Theorem 2.3.49.

Theorem 2.3.54

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Then \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Proof

\(\kappa \circ \mu \) is the composition of \(\mu \otimes \kappa \) and the deterministic kernel for the function \((x,y) \mapsto y\). The same applies for \(\eta \circ \nu \) and \(\nu \otimes \eta \). The result is then an application of the data-processing inequality, Theorem 2.3.49.

Theorem 2.3.55 Conditioning increases f-divergence

Let \(\mu \) be a finite measure \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\) .

Proof

This is a particular case of Theorem 2.3.54.

Theorem 2.3.56 Marginals

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).

Proof

The measure \(\mu _X\) is the composition of \(\mu \) and the deterministic kernel for the function \((x,y) \mapsto x\), and similarly \(\mu _Y\) is the composition of \(\mu \) and a deterministic kernel. The results are both applications of the data-processing inequality, Theorem 2.3.49.

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow (\mathcal X \times \mathcal Y)\) be a Markov kernel such that for all \(x\), \((\kappa (x))_X = \delta _x\). Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) = D_f(\mu , \nu )\).

Proof

\(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\) by Theorem 2.3.49. For the other inequality, remark that \(\mu = (\kappa \circ \mu )_X\) (and similarly for \(\nu \)). Hence by Theorem 2.3.56 \(D_f(\mu , \nu ) \le D_f(\kappa \circ \mu , \kappa \circ \nu )\).

Data-processing and mean values

In this section \(d_f(p, q)\) denotes the \(f\)-divergence between two Bernoulli distributions with means \(p\) and \(q\).

Corollary 2.3.58

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Then \(D_f(\mu , \nu ) \ge d_f(\mu (E), \nu (E))\).

Proof

Use the deterministic kernel \(\kappa : \mathcal X \rightsquigarrow \{ 0, 1\} \) with \(\kappa (x) = \delta _1 \mathbb {I}\{ x \in E\} + \delta _0 \mathbb {I}\{ x \notin E\} \) in Theorem 2.3.49.

Lemma 2.3.59

For all \(y \in [0,1]\), \(x \mapsto d_f(x, y)\) is convex and attains a minimum at \(x = y\).

Proof
\begin{align*} d_f(x, y) & = y f(\frac{x}{y}) + (1 - y) f(\frac{1 - x}{1 - y}) \: , \\ (x \mapsto d_f(x, y))’_+(x) & = f’_+(\frac{x}{y}) - f’_+(\frac{1 - x}{1 - y}) \: . \end{align*}

\(f'_+\) is non-decreasing, hence we get that \((x \mapsto d_f(x, y))'_+\) is also non-decreasing and \(x \mapsto d_f(x, y)\) is convex.

For \(x \le y\), \(x/y \le 1 \le (1-x)/(1-y)\), hence \(f'_+(\frac{x}{y}) \le f'_+(\frac{1 - x}{1 - y})\) and \((x \mapsto d_f(x, y))'_+(x) \le 0\). The opposite holds for \(x \ge y\).

Lemma 2.3.60

Let \(\mu , \nu \in \mathcal P([0,1])\). Then

\begin{align*} D_f(\mu , \nu ) \ge d_f(\mu [X], \nu [X]) \: . \end{align*}
Proof

Let \(u\) be the uniform distribution on \([0,1]\). Then \(D_f(\mu , \nu ) = D_f(\mu \times u, \nu \times u)\) (by lemma TODO about kernel with identity as first coordinate). Let \(g : [0,1] \times [0,1] \to \{ 0,1\} \) be the measurable function with \(g(x, y) = \mathbb {I}\{ y \le x\} \). We also write \(g\) for the corresponding deterministic kernel. By the data-processing inequality (Theorem 2.3.49),

\begin{align*} D_f(\mu \times u, \nu \times u) \ge D_f(g \circ (\mu \times u), g \circ (\nu \times u)) \: . \end{align*}

Those two last measures are Bernoulli distributions with means \(\mu [X]\) and \(\nu [X]\).

Corollary 2.3.61

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow [0,1]\). Then

\begin{align*} D_f(\mu , \nu ) \ge d_f((\kappa \circ \mu )[X], (\kappa \circ \nu )[X]) \: . \end{align*}
Proof

First, by the data-processing inequality (Theorem 2.3.49), \(D_f(\mu , \nu ) \ge D_f(\kappa \circ \mu , \kappa \circ \nu )\) . Then use Lemma 2.3.60.

Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then

\begin{align*} D_f(\pi \otimes Q, \xi \otimes P) & \ge d_f(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) \: . \end{align*}
Proof

Suppose first that \(\mathcal R_\pi ^Q \ge \mathcal R_\xi ^P\). Let \(\hat{y}_B\) be a Bayes estimator for the estimation task \((P, y, \ell ')\) with prior \(\xi \). If it does not exist, the proof can be adapted by taking an estimator with risk \(\varepsilon \)-close to the Bayes risk for any \(\varepsilon {\gt} 0\). By Corollary 2.3.61,

\begin{align*} D_f(\pi \otimes Q, \xi \otimes P) & \ge d_f(\ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X], \ell ’ \circ (y \parallel \hat{y}_B) \circ (\xi \otimes P) [X]) \\ & = d_f(\ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X], \mathcal R_\xi ^P) \: . \end{align*}

By definition of the Bayes risk as an infimum,

\begin{align*} \ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X] \ge \mathcal R_\pi ^Q \ge \mathcal R_\xi ^P \: . \end{align*}

Since \(x \mapsto d_f(x, y)\) is monotone on \([y, 1]\),

\begin{align*} D_f(\pi \otimes Q, \xi \otimes P) & \ge d_f(\ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X], \mathcal R_\xi ^P) \\ & \ge d_f(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) \: . \end{align*}

This concludes the proof for \(\mathcal R_\pi ^Q \ge \mathcal R_\xi ^P\). If that inequality is reversed, we proceed similarly but take the Bayes estimator for \(Q\).

2.3.7 Convexity

Theorem 2.3.63 Joint convexity

The function \((\mu , \nu ) \mapsto D_f(\mu , \nu )\) is convex.

Proof

Let \(\mu _0, \mu _1, \nu _0, \nu _1\) be four measures. Let \(\lambda \in [0,1]\). Let \(\xi \) be the probability measure on \(\{ 0,1\} \) with \(\xi (\{ 1\} ) = \lambda \). Let \(\kappa \) be the kernel \(\{ 0,1\} \rightsquigarrow \mathcal X\) defined by \(\kappa (0) = \mu _0\) and \(\kappa (1) = \mu _1\). Let \(\eta \) be the kernel \(\{ 0,1\} \rightsquigarrow \mathcal X\) defined by \(\eta (0) = \nu _0\) and \(\eta (1) = \nu _1\).

\begin{align*} D_f(\xi \otimes \kappa , \xi \otimes \eta ) & = D_f(\kappa , \eta \mid \xi ) = (1 - \lambda ) D_f(\mu _0, \nu _0) + \lambda D_f(\mu _1, \nu _1) \: , \\ D_f(\kappa \circ \xi , \eta \circ \xi ) & = D_f((1 - \lambda )\mu _0 + \lambda \mu _1, (1 - \lambda )\nu _0 + \lambda \nu _1) \: . \end{align*}

By Theorem 2.3.55, \(D_f(\kappa \circ \xi , \eta \circ \xi ) \le D_f(\xi \otimes \kappa , \xi \otimes \eta )\).

2.3.8 Variational representations

TODO

2.3.9 Comparisons between divergences

We collate here general tools that can be used to compare \(D_f\) and \(D_g\) for two functions \(f\) and \(g\). Specific divergences are discussed in the next sections.

Lemma 2.3.64

If \(f \le g\), then \(D_f(\mu , \nu ) \le D_g(\mu , \nu )\).

Proof

Monotonicity of the integral and of \(f \mapsto f'(\infty )\).

Lemma 2.3.65

If \(f(1) = 0\), \(g(1) = 0\), \(f'(1) = 0\), \(g'(1) = 0\), and both \(f\) and \(g\) have a second derivative, then

\begin{align*} D_f(\mu , \nu ) \le \sup _x \frac{f''(x)}{g''(x)} D_g(\mu , \nu ) \end{align*}
Proof

By Taylor with integral remainder, if \(f'' \le \beta g''\),

\begin{align*} f(x) = \int _1^x f”(t) (x - t) dt \le \beta \int _1^x g”(t) (x - t) dt = \beta g(x) \: . \end{align*}

Then use Lemma 2.3.64.

Theorem 2.3.66 [ SV16 ]

Suppose that \(f(1) = g(1) = 0\) and that \(f'(1) = g'(1) = 0\), and that \(g{\gt}0\) on \((0,1) \cup (1, +\infty )\). Suppose that the space \(\mathcal X\) has at least two disjoint non-empty measurable sets. Then

\begin{align*} \sup _{\mu , \nu \in \mathcal P(\mathcal X)} \frac{D_f(\mu , \nu )}{D_g(\mu , \nu )} = \sup _{t {\gt} 0, t \ne 1} \frac{f(t)}{g(t)} \: . \end{align*}

Remark: if \(\mathcal X\) does not have two disjoint non-empty measurable sets then its measurable sets are only \(\{ \emptyset , \mathcal X\} \) and the ratio on the left is \(0/0\).

Proof

TODO: proof done for measurable singletons?

By Lemma 2.3.64, \(\sup _{\mu , \nu \in \mathcal P(\mathcal X)} \frac{D_f(\mu , \nu )}{D_g(\mu , \nu )} \le \sup _{t {\gt} 0, t \ne 1} \frac{f(t)}{g(t)}\). Let’s prove the other inequality.

Let \(x\) and \(y\) be two points in disjoint non-empty measurable sets. For \(t \in (0,1) \cup (1, +\infty )\) and \(\varepsilon \in (0, 1/t]\) let \(\mu _{t, \varepsilon } = t \varepsilon \delta _x + (1 - t \varepsilon ) \delta _y\) and \(\nu _\varepsilon = \varepsilon \delta _x + (1 - \varepsilon ) \delta _y\) .

\begin{align*} D_f(\mu _{t, \varepsilon }, \nu _\varepsilon ) & = \varepsilon f(t) + (1 - \varepsilon ) f(\frac{1 - t \varepsilon }{1 - \varepsilon }) \\ & = \varepsilon f(t) + \varepsilon (1 - t) \frac{1 - \varepsilon }{\varepsilon (1 - t)} f(1 + \frac{\varepsilon (1 - t)}{1 - \varepsilon }) \end{align*}

Let \(\alpha = \frac{\varepsilon (1 - t)}{1 - \varepsilon }\). Then since \(f(1) = g(1) = 0\),

\begin{align*} \frac{D_f(\mu _{t, \varepsilon }, \nu _\varepsilon )}{D_g(\mu _{t, \varepsilon }, \nu _\varepsilon )} & = \frac{f(t) + (1 - t)\frac{f(1 + \alpha ) - f(1)}{\alpha }}{g(t) + (1 - t)\frac{g(1 + \alpha ) - g(1)}{\alpha }} \: . \end{align*}

As \(\varepsilon \to 0\), \(\alpha \) also tends to 0 and

\begin{align*} \frac{D_f(\mu _{t, \varepsilon }, \nu _\varepsilon )}{D_g(\mu _{t, \varepsilon }, \nu _\varepsilon )} \to \frac{f(t) + (1 - t)f'(1)}{g(t) + (1 - t)g'(1)} = \frac{f(t)}{g(t)} \: . \end{align*}

Bretagnolle-Huber inequalities

Let \(\mu , \nu \in \mathcal P(\mathcal X)\). If \(f(1) = 0\),

\begin{align*} D_f(\mu , \nu ) & \ge f(1 + \operatorname{TV}(\mu , \nu )) + f(1 - \operatorname{TV}(\mu , \nu )) \: , \\ \text{ and } D_f(\mu , \nu ) & \ge (1 + \operatorname{TV}(\mu , \nu ))f\left(\frac{1}{1 + \operatorname{TV}(\mu , \nu )}\right) + (1 - \operatorname{TV}(\mu , \nu ))f\left(\frac{1}{1 - \operatorname{TV}(\mu , \nu )}\right) \: . \end{align*}
Proof

Let \(V: x \mapsto \max \left\{ 0, \frac{d\mu }{d\nu }(x) - 1\right\} \) and \(W: x \mapsto \max \left\{ 0, 1 - \frac{d\mu }{d\nu }(x)\right\} \) . Then one of \(1+V\) and \(1-W\) is equal to \(\frac{d\mu }{d\nu }\) and the other is equal to 1. Since \(f(1) = 0\), we obtain \(f(\frac{d\mu }{d\nu }) = f(1+V) + f(1-W)\) . The \(f\)-divergence is then

\begin{align*} D_f(\mu , \nu ) & = \nu [f(\frac{d\mu }{d\nu })] + f’(\infty )\mu _{\perp \nu }(\mathcal X) \\ & = \nu \left[f(1+V) + f’(\infty )\mu _{\perp \nu }(\mathcal X)\right] + \nu \left[f(1 - W)\right] \: . \end{align*}

By convexity of \(f\), \(f(1+V) + f'(\infty )\mu _{\perp \nu }(\mathcal X) \ge f(1+V + \mu _{\perp \nu }(\mathcal X))\) . Using convexity again, we get

\begin{align*} D_f(\mu , \nu ) & \ge f(1+ \nu [V] + \mu _{\perp \nu }(\mathcal X)) + f(1 - \nu [W]) \: . \end{align*}

By Lemma 2.2.21, \(\nu [V] + \mu _{\perp \nu }(\mathcal X) = \nu [W] = \operatorname{TV}(\mu , \nu )\). We obtain the first equality.

To get the second equality, remark that \(D_f(\mu , \nu ) = D_{x\mapsto xf(1/x)}(\nu , \mu )\) (Theorem TODO), apply the first inequality to that function and use the symmetry of \(\operatorname{TV}\).

Corollary 2.3.68 Bretagnolle-Huber inequality

TODO: move this somewhere after the definition of \(KL\).

For \(\mu , \nu \in \mathcal P(\mathcal X)\),

\begin{align*} \operatorname{KL}(\mu , \nu ) & \ge - \log (1 - \operatorname{TV}^2(\mu , \nu )) \: , \\ \text{or equivalently, } \operatorname{TV}(\mu , \nu ) & \le \sqrt{1 - \exp \left(-\operatorname{KL}(\mu , \nu ) \right)} \: . \end{align*}
Proof

Use the second inequality of Theorem 2.3.67, for \(f: x \mapsto x \log x\) .

TODO: move this somewhere after the definition of \(\operatorname{H}_\alpha \).

Let \(\mu , \nu \in \mathcal P(\mathcal X)\). For \(\alpha {\gt} 0\),

\begin{align*} \operatorname{H}_{1+\alpha }(\mu , \nu ) & \ge \frac{1}{\alpha } \left( (1 - \operatorname{TV}(\mu , \nu ))^{-\alpha } - 2 \right) \: , \\ \text{or equivalently, } 1 - \operatorname{TV}(\mu , \nu ) & \ge \exp \left(-\frac{1}{\alpha } \log \left( 2 + \alpha \operatorname{H}_{1+\alpha }(\mu , \nu ) \right)\right) \: . \end{align*}
Proof

Use the second inequality of Theorem 2.3.67, for \(f: x \mapsto \frac{x^{1+\alpha } - 1}{\alpha }\) to get

\begin{align*} \operatorname{H}_{1+\alpha }(\mu , \nu ) \ge \frac{1}{\alpha } \left((1 + \operatorname{TV}(\mu , \nu ))^{-\alpha } + (1 - \operatorname{TV}(\mu , \nu ))^{-\alpha } - 2 \right) \: . \end{align*}

We then neglect one term: \((1 + \operatorname{TV}(\mu , \nu ))^{-\alpha } \ge 0\).

Joint range

TODO [ HV11 ]

2.3.10 Statistical divergences

On probability measures, the statistical information \(\mathcal I_\xi \) is an \(f\)-divergence for the function \(\phi _{\xi _0, \xi _1}\) of Definition 20.

Proof

This is a reformulation of Theorem 2.2.9.

On probability measures, the total variation distance \(\operatorname{TV}\) is an \(f\)-divergence for the function \(x \mapsto \frac{1}{2}\vert x - 1 \vert \).

Proof

This is a reformulation of Lemma 2.2.22.

2.4 Kullback-Leibler divergence

Definition 21 Kullback-Leibler divergence
#

Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Kullback-Leibler divergence between \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{KL}(\mu , \nu ) = \mu \left[\log \frac{d \mu }{d \nu }\right] \end{align*}

if \(\mu \ll \nu \) and \(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable and \(+\infty \) otherwise.

Lemma 2.4.1

\(\operatorname{KL}(\mu , \nu ) = D_f(\mu , \nu )\) for \(f: x \mapsto x \log x\).

Proof

Simple computation.

Definition 22 Conditional Kullback-Leibler divergence
#

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Kullback-Leibler divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is

\begin{align*} \operatorname{KL}(\kappa , \eta \mid \mu ) = \mu \left[x \mapsto \operatorname{KL}(\kappa (x), \eta (x))\right] \end{align*}

if \(x \mapsto \operatorname{KL}(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.

Lemma 2.4.2

\(\operatorname{KL}(\kappa , \eta \mid \mu ) = D_f(\kappa , \eta \mid \mu )\) for \(f: x \mapsto x \log x\).

Proof

Simple computation.

2.4.1 Properties inherited from f-divergences

Since \(\operatorname{KL}\) is an f-divergence, every inequality for f-divergences can be translated to \(\operatorname{KL}\).

Lemma 2.4.3

\(\operatorname{KL}(\mu , \mu ) = 0\).

Proof

KL is an f-divergence thanks to Lemma 2.4.1, then apply Lemma 2.3.3.

Let \(\mu , \nu \) be finite measures on \(\mathcal X\), \(\xi \) be a finite measure on \(\mathcal Y\). Then \(\operatorname{KL}(x \mapsto \mu , x \mapsto \nu \mid \xi ) = \operatorname{KL}(\mu , \nu ) \xi (\mathcal X)\).

Proof

KL is an f-divergence thanks to Lemma 2.4.1 and Lemma 2.4.2, then apply Lemma 2.3.22.

Theorem 2.4.5 Marginals

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(\operatorname{KL}(\mu _X, \nu _X) \le \operatorname{KL}(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(\operatorname{KL}(\mu _Y, \nu _Y) \le \operatorname{KL}(\mu , \nu )\).

Proof

Apply Theorem 2.3.56.

Theorem 2.4.6 Data-processing

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(\operatorname{KL}(\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{KL}(\mu , \nu )\).

Proof

Apply Theorem 2.3.49.

Lemma 2.4.7

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\). Then \(\operatorname{KL}(\mu , \nu ) \ge \mu (\mathcal X) \log \frac{\mu (\mathcal X)}{\nu (\mathcal X)}\).

Proof
Lemma 2.4.8 Gibbs’ inequality
#

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{KL}(\mu , \nu ) \ge 0\).

Proof

Apply Lemma 2.4.7 and use \(\mu (\mathcal X) = \nu (\mathcal X)\).

Lemma 2.4.9 Converse Gibbs’ inequality

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{KL}(\mu , \nu ) = 0\) if and only if \(\mu = \nu \).

Proof

KL is an f-divergence thanks to Lemma 2.4.1, then apply Lemma 2.3.4.

\((\mu , \nu ) \mapsto \operatorname{KL}(\mu , \nu )\) is convex.

Proof

Apply Theorem 2.3.63

Lemma 2.4.11

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels such that \(\mu \otimes \kappa \ll \nu \otimes \eta \). Then \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is \(\mu \otimes \kappa \)-integrable if and only if the following hold:

  1. \(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable

  2. \(x \mapsto \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x)\) is \(\mu \)-integrable

  3. for \(\mu \)-almost all \(x\), \(y \mapsto \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable

Proof

Note that from the hypothesis it easily follows that \(\mu \ll \nu \) and \(\mu \)-a.e. \(\kappa (x) \ll \eta (x)\).

Applying Lemma 2.3.23 we know that \(\log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }\) is \(\nu \otimes \eta \)-integrable iff the following hold:

  1. for \(\nu \)-almost all \(x\), \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \) is \(\eta (x)\)-integrable

  2. \(x \mapsto \int _y \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x)\) is \(\nu \)-integrable

(i.\(\Leftarrow \)) 3. implies that \(\nu \)-a.e. \(y \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable, hence \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\eta (x)\)-integrable \(\nu \)-a.e. Moreover \(y \mapsto \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\eta (x)\)-integrable, then, since \(\frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x)\) is constant in \(y\), \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \mu }{d \nu }(x)\) is \(\eta (x)\)-integrable. This allows us to conclude that \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) + \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \mu }{d \nu }(x) = \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right)\) is \(\eta (x)\)-integrable \(\nu \)-a.e.

(ii.\(\Leftarrow \)) 1. implies that \(x \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x)\) is \(\nu \)-integrable. 2. implies that \(x \mapsto \frac{d \mu }{d \nu }(x) \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \eta (x)\) is \(\nu \)-integrable. Therefore

\[ x \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x) + \frac{d \mu }{d \nu }(x) \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \eta (x) = \int _y \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x) \]

is \(\nu \)-integrable.

(\(\Rightarrow \)1.) This follows from the hypothesis applying a more general Lemma for f divergences, see ProbabilityTheory.integrable_f_rnDeriv_of_integrable_compProd’ in the lean code.

(\(\Rightarrow \)3.) i. implies that \(\mu \)-a.e. \(y \mapsto \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right)\) is \(\eta (x)\)-integrable, hence \(y \mapsto \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) = \log \frac{d \mu }{d \nu }(x) + \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable. Then, since \(\log \frac{d \mu }{d \nu }(x)\) is constant in \(y\) and \(\kappa (x)\) is a finite measure, \(y \mapsto \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable.

(\(\Rightarrow \)2.) ii. implies that \(x \mapsto \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x) = \int _y \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \kappa (x)\) is \(\mu \)-integrable. Thanks to 3. we can split the integral:

\[ \int _y \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \kappa (x) = \int _y \log \frac{d \mu }{d \nu }(x) \partial \kappa (x) + \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x) = \log \frac{d \mu }{d \nu }(x) + \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x) \]

Then using 1. we can conclude that \(x \mapsto \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x)\) is \(\mu \)-integrable.

2.4.2 Chain rule and tensorization

Theorem 2.4.12 Chain rule, kernel version

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels. Then \(\operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\kappa , \eta \mid \mu )\).

Proof

Handle the case where \(\mu \otimes \kappa \) is not absolutely continuous with respect to \(\nu \otimes \eta \) first, then the case where \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is not \(\mu \otimes \kappa \)-integrable using Lemma 2.4.11.

For the case where \(\mu \otimes \kappa \ll \nu \otimes \eta \) and \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is \(\mu \otimes \kappa \)-integrable use Lemma B.1.5 and Corollary B.0.2 in a computation:

\begin{align*} \operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) & = \int _p \log \frac{d(\mu \otimes \kappa )}{d(\nu \otimes \eta )}(p) \partial (\mu \otimes \kappa ) \\ & = \int _x \int _y\log \left(\frac{d\mu }{d \nu }(x)\frac{d\kappa }{d \eta }(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \int _y\log \left(\frac{d\mu }{d \nu }(x)\right) + \log \left(\frac{d\kappa }{d \eta }(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \log \left(\frac{d\mu }{d \nu }(x)\right)\partial \mu + \int _x \int _y\log \left(\frac{d\kappa }{d \eta }(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \log \left(\frac{d\mu }{d \nu }(x)\right)\partial \mu + \int _x \int _y\log \left(\frac{d\kappa (x)}{d \eta (x)}(y)\right) \partial \kappa (x) \partial \mu \\ & = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\kappa , \eta \mid \mu ) \: . \end{align*}
Theorem 2.4.13 Chain rule, with Bayesian inverse

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels with Bayesian inverses \(\kappa _\mu ^\dagger \) and \(\eta _\nu ^\dagger \). Then

\begin{align*} \operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\kappa \circ \mu , \eta \circ \nu ) + \operatorname{KL}(\kappa _\mu ^\dagger , \eta _\nu ^\dagger \mid \kappa \circ \mu ) \: . \end{align*}
Proof
Theorem 2.4.14 Chain rule, product version

Let \(\mu , \nu \) be two finite measures on \(\mathcal X \times \mathcal Y\), where \(\mathcal Y\) is standard Borel. Then \(\operatorname{KL}(\mu , \nu ) = \operatorname{KL}(\mu _X, \nu _X) + \operatorname{KL}(\mu _{Y|X}, \nu _{Y|X} \mid \mu _X)\).

Proof

Write \(\mu = \mu _X \otimes \mu _{Y|X}\) and \(\nu = \nu _X \otimes \nu _{Y|X}\), then use Theorem 2.4.12.

Lemma 2.4.15

Let \(\mu _1, \nu _1\) be finite measures on \(\mathcal X\) and \(\mu _2, \nu _2\) probability measures on \(\mathcal{Y}\). Then

\[ \operatorname{KL}(\mu _1\times \mu _2, \nu _1 \times \nu _2) = \operatorname{KL}(\mu _1, \nu _1) + \operatorname{KL}(\mu _2, \nu _2) \mu _1 (\mathcal X) \: . \]
Proof

Write \(\mu _1 \times \mu _2\) and \(\nu _1 \times \nu _2\) as composition products of a measure and a constant kernel, then apply the chain rule (Theorem 2.4.12) and Lemma 2.4.4.

Theorem 2.4.16 Tensorization

For \(\mu _1\) a probability measure on \(\mathcal X\), \(\nu _1\) a finite measure on \(\mathcal{X}\) and \(\mu _2, \nu _2\) two probability measures on \(\mathcal Y\),

\begin{align*} \operatorname{KL}(\mu _1 \times \mu _2, \nu _1 \times \nu _2) = \operatorname{KL}(\mu _1, \nu _1) + \operatorname{KL}(\mu _2 \times \nu _2) \: . \end{align*}
Proof

This is a particular case of Lemma 2.4.15.

Theorem 2.4.17 Tensorization - finite product
#

Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on spaces \((\mathcal X_i)_{i \in I}\). Then

\begin{align*} \operatorname{KL}(\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} \operatorname{KL}(\mu _i, \nu _i) \: . \end{align*}
Proof

Induction on the size of \(I\), using Theorem 2.4.16.

Let \(\mu , \nu \) be two measures and \(E\) an event. Let \(\mu _{|E}\) be the measure defined by \(\mu _{|E}(A) = \frac{\mu (A \cap E)}{\mu (E)}\) and define \(\nu _{|E}\), \(\mu _{| E^c}\) and \(\nu _{| E^c}\) similarly. Let \(\operatorname{kl}(p,q) = p\log \frac{p}{q} + (1-p)\log \frac{1-p}{1-q}\) be the Kullback-Leibler divergence between Bernoulli distributions with means \(p\) and \(q\). Then

\begin{align*} \operatorname{KL}(\mu , \nu ) \ge \operatorname{kl}(\mu (E), \nu (E)) + \mu (E) \operatorname{KL}(\mu _{|E}, \nu _{|E}) + \mu (E^c) \operatorname{KL}(\mu _{|E^c}, \nu _{|E^c}) \: . \end{align*}
Proof

Let \(\mu _E\) be the Bernoulli distribution with \(\mu _E(\{ 1\} ) = \mu (E)\), and define \(\nu _E\) similarly. Let \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel defined by \(\kappa (1) = \mu _{|E}\) and \(\kappa (0) = \mu _{|E^c}\). Define a kernel \(\eta \) similarly for \(\nu \). Then \(\mu = \kappa \circ \mu _E\) and \(\nu = \eta \circ \nu _E\).

\(\mu _E \otimes \kappa \) is equal to the composition of \(\mu \) and the deterministic kernel given by the function \(f : x \mapsto (\mathbb {I}_E(x), x)\). By Lemma 2.3.57, that gives \(\operatorname{KL}(\mu _E \otimes \kappa , \nu _E \otimes \eta ) = \operatorname{KL}(\mu , \nu )\).

Using the chain rule (Theorem 2.4.12),

\begin{align*} \operatorname{KL}(\mu _E \otimes \kappa , \nu _E \otimes \eta ) & = \operatorname{KL}(\mu _E, \nu _E) + \operatorname{KL}(\kappa , \eta \mid \mu _E) \\ & = \operatorname{KL}(\mu _E, \nu _E) + \mu (E) \operatorname{KL}(\mu _{|E}, \nu _{|E}) + \mu (E^c) \operatorname{KL}(\mu _{|E^c}, \nu _{|E^c}) \: . \end{align*}
Lemma 2.4.19

Let \(\mu , \nu \) be two measures and \(E\) an event. Then \(\mu (E)\log \frac{\mu (E)}{\nu (E)} \le \mu \left[\mathbb {I}(E)\log \frac{d \mu }{d \nu }\right]\) .

Proof

Apply Lemma 2.4.7 to the measures \(\mu \) and \(\nu \) restricted to \(E\). \(\mu \left[\mathbb {I}(E)\log \frac{d \mu }{d \nu }\right]\) is the KL between those two distributions.

TODO: find a way to hide the following dummy lemma

Proof

2.5 Hellinger alpha-divergences

Definition 23 Hellinger \(\alpha \)-divergence

Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Hellinger divergence of order \(\alpha \in [0,+\infty )\) between \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{H}_\alpha (\mu , \nu ) = \left\{ \begin{array}{ll} \nu \{ x \mid \frac{d\mu }{d\nu } (x) = 0\} & \text{for } \alpha = 0 \\ \operatorname{KL}(\mu , \nu ) & \text{for } \alpha = 1 \\ D_{f_\alpha }(\mu , \nu ) & \text{for } \alpha \in (0,+\infty ) \backslash \{ 1\} \end{array}\right. \end{align*}

with \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).

Lemma 2.5.1

For \(\alpha \in [0, 1)\) and finite measures \(\mu , \nu \), \(\operatorname{H}_\alpha (\mu , \nu ) {\lt} \infty \).

Proof
Lemma 2.5.2

For \(\alpha \in (0,1)\cup (1, \infty )\), \(\mu \) a finite measure and \(\nu \) a probability measure, if \(\operatorname{H}_\alpha (\mu , \nu ) {\lt} \infty \) then

\begin{align*} \operatorname{H}_\alpha (\mu , \nu ) = \frac{1}{\alpha - 1} \left( \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu - 1 \right) \: . \end{align*}
Proof
Lemma 2.5.3
#

For \(\alpha \in (0,1)\cup (1, \infty )\), \(\mu , \nu \) two sigma-finite measures,

\begin{align*} \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu = \int _x \left(\frac{d \nu }{d \mu }(x)\right)^{1 - \alpha } \partial \mu \: . \end{align*}
Proof
\begin{align*} \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^\alpha \partial \nu \\ & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^\alpha \frac{d \nu }{d (\mu + \nu )}(x) \partial (\mu + \nu ) \\ & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^{\alpha - 1} \frac{d \mu }{d (\mu + \nu )}(x) \partial (\mu + \nu ) \\ & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^{\alpha - 1} \partial \mu \\ & = \int _x \left(\frac{d \nu }{d \mu }(x)\right)^{1 - \alpha } \partial \mu \: . \end{align*}
Lemma 2.5.4

For \(\alpha \in (0, 1)\) and finite measures \(\mu , \nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\), \((1 - \alpha ) \operatorname{H}_\alpha (\mu , \nu ) = \alpha \operatorname{H}_{1 - \alpha }(\nu , \mu )\).

Proof

Use Lemma 2.5.3.

Definition 24 Conditional Hellinger \(\alpha \)-divergence

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Hellinger divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is

\begin{align*} \operatorname{H}_\alpha (\kappa , \eta \mid \mu ) = D_{f_\alpha }(\kappa , \eta \mid \mu ) \: , \end{align*}

for \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).

2.5.1 Properties inherited from f-divergences

Since \(\operatorname{H}_\alpha \) is an f-divergence, every inequality for f-divergences can be translated to \(\operatorname{H}_\alpha \).

Theorem 2.5.5 Data-processing

Let \(\alpha {\gt} 0\), \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(\operatorname{H}_\alpha (\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{H}_\alpha (\mu , \nu )\).

Proof

Apply Theorem 2.3.49.

Lemma 2.5.6

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{H}_\alpha (\mu , \nu ) \ge 0\).

Proof

Apply Lemma 2.3.20.

Lemma 2.5.7

\((\mu , \nu ) \mapsto \operatorname{H}_\alpha (\mu , \nu )\) is convex.

Proof

Apply Theorem 2.3.63

TODO: find a way to hide the following dummy lemma

Lemma 2.5.8 Dummy lemma: hellingerAlpha properties

Dummy node to summarize properties of the Hellinger \(\alpha \)-divergence.

Proof

2.6 Rényi divergences

Definition 25 Rényi divergence

Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Rényi divergence of order \(\alpha \in \mathbb {R}\) between \(\mu \) and \(\nu \) is

\begin{align*} R_\alpha (\mu , \nu ) = \left\{ \begin{array}{ll} \operatorname{KL}(\mu , \nu ) & \text{for } \alpha = 1 \\ \frac{1}{\alpha - 1} \log (\nu (\mathcal X) + (\alpha - 1) \operatorname{H}_\alpha (\mu , \nu )) & \text{for } \alpha \neq 1 \end{array}\right. \end{align*}
Lemma 2.6.1

For \(\mu \) a sigma-finite measure and \(\nu \) a finite measure

\[ R_0(\mu , \nu ) = - \log (\nu \{ x \mid \frac{d \mu }{d \nu }(x) {\gt} 0\} ) \: . \]
Proof

For \(\alpha \in [0, 1)\) and finite measures \(\mu , \nu \),

\[ R_\alpha (\mu , \nu ) = \infty \iff \mu \perp \nu \: . \]
Proof
Lemma 2.6.3

For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then

\begin{align*} R_\alpha (\mu , \nu ) = \frac{1}{\alpha - 1} \log \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu \: . \end{align*}
Proof

For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then

\begin{align*} R_\alpha (\mu , \nu ) = \frac{1}{\alpha - 1} \log \int _x \left(\frac{d \mu }{d \nu }(x)\right)^{\alpha - 1} \partial \mu \: . \end{align*}
Proof
Lemma 2.6.5

For \(\alpha \in (0, 1)\) and finite measures \(\mu , \nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\),

\[ (1 - \alpha ) R_\alpha (\mu , \nu ) = \alpha R_{1 - \alpha }(\nu , \mu ) \: . \]
Proof

Use Lemma 2.5.4.

Lemma 2.6.6

The cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\nu \) is \(\alpha \mapsto (\alpha - 1) R_\alpha (\mu , \nu )\).

Proof

Unfold the definitions, using Lemma 2.6.3 for the Rényi divergence.

Lemma 2.6.7
#

Set \(\alpha {\gt} 0\). If \(R_{1+\alpha }(\mu , \nu ) {\lt} \infty \), the cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\mu \) has value \(\alpha R_{1+\alpha }(\mu , \nu )\) at \(\alpha \).

Proof

Unfold the definitions, using Lemma 2.6.4 for the Rényi divergence.

Definition 26 Conditional Rényi divergence
#

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Rényi divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is

\begin{align*} R_\alpha (\kappa , \eta \mid \mu ) =\frac{1}{\alpha - 1} \log (1 + (\alpha - 1) \operatorname{H}_\alpha (\kappa , \eta \mid \mu )) \: . \end{align*}

TODO: the reasons for this definition is that it is equal to \(R_\alpha (\mu \otimes \kappa , \mu \otimes \eta )\), which is the generic conditional divergence definition, and also that this is the definition for which we have a chain rule.

2.6.1 Properties inherited from f-divergences

Since Rényi divergences are monotone transfomations of f-divergences, every inequality for f-divergences can be translated to Rényi divergences.

Theorem 2.6.8 Data-processing

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(R_\alpha (\kappa \circ \mu , \kappa \circ \nu ) \le R_\alpha (\mu , \nu )\).

Proof

The function \(x \mapsto \frac{1}{\alpha - 1}\log (1 + (\alpha - 1)x)\) is non-decreasing and \(D_f\) satisfies the DPI (Theorem 2.3.49), hence we get the DPI for \(R_\alpha \).

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\).

Proof

By Corollary 2.3.58, \(D_f(\mu , \nu ) \ge D_f(\mu _E, \nu _E)\), hence \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\).

2.6.2 Comparisons between orders

Lemma 2.6.10

Let \(\mu , \nu \) be two finite measures. \(R_0(\mu , \nu ) = \lim _{\alpha \downarrow 0} R_\alpha (\mu , \nu )\).

Proof
Lemma 2.6.11

Let \(\mu , \nu \) be two finite measures. \(R_1(\mu , \nu ) = \lim _{\alpha \uparrow 1} R_\alpha (\mu , \nu )\).

Proof
Lemma 2.6.12

Let \(\mu , \nu \) be two finite measures such that there exists \(\alpha {\gt} 1\) with \(R_\alpha (\mu , \nu )\) finite. Then \(R_1(\mu , \nu ) = \lim _{\alpha \downarrow 1} R_\alpha (\mu , \nu )\).

Proof
Lemma 2.6.13

Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is nondecreasing on \([0, + \infty )\).

Proof
Lemma 2.6.14

Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is continuous on \([0, 1]\) and on \([0, \sup \{ \alpha \mid R_\alpha (\mu , \nu ) {\lt} \infty \} )\).

Proof

2.6.3 The tilted measure

Definition 27
#

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\alpha \in (0, +\infty ) \backslash \{ 1\} \). Let \(p = \frac{d \mu }{d (\mu + \nu )}\) and \(q = \frac{d \nu }{d (\mu + \nu )}\). We define a measure \(\mu ^{(\alpha , \nu )}\), absolutely continuous with respect to \(\mu + \nu \) with density

\begin{align*} \frac{d \mu ^{(\alpha , \nu )}}{d (\mu + \nu )} = p^\alpha q^{1 - \alpha } e^{-(\alpha - 1)R_\alpha (\mu , \nu )} \: . \end{align*}
Lemma 2.6.15

For \(\alpha \in (0,1)\), \(\mu ^{(\alpha , \nu )} \ll \mu \) and \(\mu ^{(\alpha , \nu )} \ll \nu \).

Proof
Lemma 2.6.16

\(\frac{d \mu ^{(\alpha , \nu )}}{d \nu } = \left(\frac{d\mu }{d\nu }\right)^\alpha e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\nu \)-a.e., and \(\frac{d \mu ^{(\alpha , \nu )}}{d \mu } = \left(\frac{d\nu }{d\mu }\right)^{1 - \alpha } e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\mu \)-a.e..

Proof

Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} \operatorname{KL}(\xi , \mu ^{(\alpha , \nu )}) = \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) - (1 - \alpha ) R_\alpha (\mu , \nu ) \: . \end{align*}
Proof

Unfold definitions and compute?

Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} (1 - \alpha ) R_\alpha (\mu , \nu ) = \alpha \operatorname{KL}(\mu ^{(\alpha , \nu )}, \mu ) + (1 - \alpha )\operatorname{KL}(\mu ^{(\alpha , \nu )}, \nu ) \: . \end{align*}
Proof

Apply Lemma 2.6.17 to \(\xi = \mu ^{(\alpha , \nu )}\).

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} (1 - \alpha ) R_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) \right) \: . \end{align*}

The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).

Proof

By Lemma 2.6.17 and Lemma 2.4.8, for all \(\xi \),

\begin{align*} \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) - (1 - \alpha ) R_\alpha (\mu , \nu ) = \operatorname{KL}(\xi , \mu ^{(\alpha , \nu )}) \ge 0 \: . \end{align*}

We have then \((1 - \alpha ) R_\alpha (\mu , \nu ) \le \inf _{\xi \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) \right)\). Corollary 2.6.18 proves the other inequality.

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then

\begin{align*} (1 - \alpha ) R_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)} \operatorname{KL}\left( \pi _\alpha \times \xi , \pi _\alpha \otimes P \right) \: . \end{align*}

The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).

Compare with Lemma 2.9.5 on the Jensen-Shannon divergence.

Proof
\begin{align*} \operatorname{KL}\left( \pi _\alpha \times \xi , \pi _\alpha \otimes P \right) & = \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha ) \operatorname{KL}(\xi , \nu ) \: . \end{align*}

The result is a rephrasing of Lemma 2.6.19.

2.6.4 Chain rule and tensorization

Theorem 2.6.21 Chain rule

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels. Then \(R_\alpha (\mu \otimes \kappa , \nu \otimes \eta ) = R_\alpha (\mu , \nu ) + R_\alpha (\kappa , \eta \mid \mu ^{(\alpha , \nu )})\).

Proof
Theorem 2.6.22 Chain rule, with Bayesian inverse

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels with Bayesian inverses \(\kappa _\mu ^\dagger \) and \(\eta _\nu ^\dagger \). Then

\begin{align*} R_\alpha (\mu \otimes \kappa , \nu \otimes \eta ) = R_\alpha (\kappa \circ \mu , \eta \circ \nu ) + R_\alpha (\kappa _\mu ^\dagger , \eta _\nu ^\dagger \mid (\kappa \circ \mu )^{(\alpha , \eta \circ \nu )}) \: . \end{align*}
Proof
Corollary 2.6.23

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\xi , \lambda \in \mathcal P(\mathcal Y)\). Then \(R_\alpha (\mu \times \xi , \nu \times \lambda ) = R_\alpha (\mu , \nu ) + R_\alpha (\xi , \lambda )\).

Proof

Apply Theorem 2.6.21 (the product of measures is a special case of \(\otimes \)). TODO: more details.

Theorem 2.6.24 Tensorization - finite product

Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).

Proof

Induction over the finite index set, using Corollary 2.6.23.

Corollary 2.6.25

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(R_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_\alpha (\mu , \nu )\).

Proof

Apply Theorem 2.6.24.

Theorem 2.6.26 Tensorization - countable product

Let \(I\) be a countable index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).

Proof

TODO: find a way to hide the following dummy lemma

Proof

2.7 Hellinger distance

TODO: refactor this section to link it to the Hellinger \(\alpha \)-divergence.

Definition 28 Squared Hellinger distance
#

Let \(\mu , \nu \) be two measures. The squared Hellinger distance between \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{H^2}(\mu , \nu ) = D_f(\mu , \nu ) \quad \text{with } f: x \mapsto \frac{1}{2}\left( 1 - \sqrt{x} \right)^2 \: . \end{align*}
Lemma 2.7.1 Hellinger and Rényi

Let \(\mu , \nu \) be two probability measures. Then \(R_{1/2}(\mu , \nu ) = -2\log (1 - \operatorname{H^2}(\mu , \nu ))\).

Proof

\(R_{1/2}(\mu , \nu ) = -2 \log (1 - \frac{1}{2} D_f(\nu , \mu ))\) for \(f : x \mapsto -2 (\sqrt{x} - 1)\). Using Lemma 2.3.6, \(R_{1/2}(\mu , \nu ) = -2 \log (1 - D_g(\nu , \mu ))\) for \(g(x) = 1 - \sqrt{x}\).

It suffices then to show that \(\operatorname{H^2}(\mu , \nu ) = D_g(\mu , \nu )\), which is true by an application of Lemma 2.3.8.

Let \(\mu , \nu \) be two probability measures. Then \(2 \operatorname{H^2}(\mu , \nu ) \le R_{1/2}(\mu , \nu )\).

Proof

Use Lemma 2.7.1, then \(-\log (1 - x) \ge x\).

Lemma 2.7.3

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{H^2}(\mu , \nu ) \le \operatorname{TV}(\mu , \nu )\).

Proof
Lemma 2.7.4

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{TV}(\mu , \nu ) \le \sqrt{\operatorname{H^2}(\mu , \nu )(2 - \operatorname{H^2}(\mu , \nu ))}\).

Proof

Let \(\mu , \nu \) be two probability measures. Then

\begin{align*} \frac{1}{2}(1 - \operatorname{H^2}(\mu , \nu ))^2 \le 1 - \operatorname{TV}(\mu , \nu ) - \frac{1}{2}(1 - \operatorname{TV}(\mu , \nu ))^2 \: . \end{align*}
Proof

We use Lemma 2.7.4.

\begin{align*} 1 - \operatorname{TV}(\mu , \nu ) - \frac{1}{2}(1 - \operatorname{H^2}(\mu , \nu ))^2 & = \frac{1}{2} - \operatorname{TV}(\mu , \nu ) + \frac{1}{2}(\operatorname{H^2}(\mu , \nu )(2 - \operatorname{H^2}(\mu , \nu ))) \\ & \ge \frac{1}{2} - \operatorname{TV}(\mu , \nu ) + \frac{1}{2}\operatorname{TV}^2(\mu , \nu ) \\ & = \frac{1}{2}(1 - \operatorname{TV}(\mu , \nu ))^2 \: . \end{align*}

Let \(\mu , \nu \) be two probability measures. Then

\begin{align*} 1 - \sqrt{1 - (1 - \operatorname{H^2}(\mu , \nu ))^2} \le 1 - \operatorname{TV}(\mu , \nu ) \le 1 - \operatorname{H^2}(\mu , \nu ) \: . \end{align*}
Proof

Use Lemma 2.7.4 and 2.7.3.

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(n \in \mathbb {N}\), and \(\mu ^{\otimes n}, \nu ^{\otimes n}\) be their product measures on \(\mathcal X^n\). Then

\begin{align*} \frac{1}{2}e^{-n R_{1/2}(\mu , \nu )} \le 1 - \sqrt{1 - e^{-n R_{1/2}(\mu , \nu )}} \le 1 - \operatorname{TV}(\mu ^{\otimes n}, \nu ^{\otimes n}) \le e^{-\frac{1}{2} n R_{1/2}(\mu , \nu )} \: . \end{align*}
Proof

Use Corollary 2.7.6 and Lemma 2.7.1 to get to a comparison of \(\operatorname{TV}(\mu , \nu )\) and \(R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n})\). Finally \(R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_{1/2}(\mu , \nu )\) by Lemma 2.6.25.

2.8 Chernoff divergence

Definition 29 Chernoff divergence
#

The Chernoff divergence of order \(\alpha {\gt} 0\) between two measures \(\mu \) and \(\nu \) on \(\mathcal X\) is

\begin{align*} C_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\max \{ R_\alpha (\xi , \mu ), R_\alpha (\xi , \nu )\} \: . \end{align*}

Note: the name Chernoff divergence is usually reserved for \(C_1\), and any reference to \(C\) without subscript in this document refers to \(C_1\). The extension to other orders \(\alpha \) is motivated by the appearance of \(C_\alpha \) in a result further down. This is not a well established notion and it remains to see if this extension is meaningful.

Lemma 2.8.1

\(C_1(\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\max \{ \operatorname{KL}(\xi , \mu ), \operatorname{KL}(\xi , \nu )\} \) .

Proof

This is \(R_1 = \operatorname{KL}\) (by definition).

Lemma 2.8.2 Symmetry

\(C_\alpha (\mu , \nu ) = C_\alpha (\nu , \mu )\).

Proof

Immediate from the definition.

Lemma 2.8.3 Monotonicity

The function \(\alpha \mapsto C_\alpha (\mu , \nu )\) is monotone.

Proof

Consequence of Lemma 2.6.13.

Lemma 2.8.4
\begin{align*} C_1(\mu , \nu ) = \max _{\alpha \in [0,1]} (1 - \alpha )R_\alpha (\mu , \nu ) \: . \end{align*}
Proof

2.9 Jensen-Shannon divergence

Definition 30 Jensen-Shannon divergence

The Jensen-Shannon divergence indexed by \(\alpha \in (0,1)\) between two measures \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \alpha \operatorname{KL}(\mu , \alpha \mu + (1 - \alpha )\nu ) + (1 - \alpha ) \operatorname{KL}(\nu , \alpha \mu + (1 - \alpha )\nu ) \: . \end{align*}
Lemma 2.9.1

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \operatorname{KL}(\pi _\alpha \otimes P, \pi _\alpha \times (P \circ \pi _\alpha )) \: . \end{align*}
Proof

\(\operatorname{JS}_\alpha \) is an \(f\)-divergence for \(f(x) = \alpha x \log (x) - (\alpha x + 1 - \alpha ) \log (\alpha x + 1 - \alpha )\) .

Proof

\((\mu , \nu ) \mapsto \operatorname{KL}(\mu , \alpha \mu + (1 - \alpha ) \nu )\) is an \(f\)-divergence for the function \(x \mapsto x\log \left(\frac{x}{\alpha x + 1 - \alpha }\right)\) (Lemma 2.3.12).

\((\mu , \nu ) \mapsto \operatorname{KL}(\nu , \alpha \mu + (1 - \alpha ) \nu )\) is an \(f\)-divergence for the function \(x \mapsto x\log \left(\frac{1}{\alpha x + 1 - \alpha }\right)\) (Lemma 2.3.13).

By Lemmas 2.3.6 and 2.3.7, we obtain the result.

Lemma 2.9.3

For \(\alpha \in (0,1)\) and \(\mu , \nu \in \mathcal M(\mathcal X)\),

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \operatorname{JS}_{1 - \alpha }(\nu , \mu ) \: . \end{align*}
Proof

Immediate from the definition.

Lemma 2.9.4

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\mu , \xi ) + (1 - \alpha )\operatorname{KL}(\nu , \xi ) \right) \: . \end{align*}

The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).

Proof

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)} \operatorname{KL}\left( \pi _\alpha \otimes P, \pi _\alpha \times \xi \right) \: . \end{align*}

The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).

Compare with Lemma 2.6.20 on the Rényi divergence.

Proof

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(\operatorname{JS}_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) \le n \operatorname{JS}_\alpha (\mu , \nu )\).

Proof

By Lemma 2.9.4,

\begin{align*} \operatorname{JS}_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) & = \inf _{\xi \in \mathcal P(\mathcal X^n)}\left( \alpha \operatorname{KL}(\mu ^{\otimes n}, \xi ) + (1 - \alpha )\operatorname{KL}(\nu ^{\otimes n}, \xi ) \right) \\ & \le \inf _{\zeta \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\mu ^{\otimes n}, \zeta ^{\otimes n}) + (1 - \alpha )\operatorname{KL}(\nu ^{\otimes n}, \zeta ^{\otimes n}) \right) \\ & = n \inf _{\zeta \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\mu , \zeta ) + (1 - \alpha )\operatorname{KL}(\nu , \zeta ) \right) \\ & = n \operatorname{JS}_\alpha (\mu , \nu ) \: . \end{align*}

2.9.1 Comparisons with other divergences

For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha , \lambda \in (0,1)\) ,

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) \le \frac{(1 - \alpha )^{1 - \lambda } \alpha ^\lambda }{\lambda ^{1 - \lambda } (1 - \lambda )^\lambda } (1 - \lambda )\operatorname{H}_\lambda (\mu , \nu ) \: . \end{align*}

In particular,

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & \le \left(\frac{1 - \alpha }{\alpha }\right)^{1 - 2 \alpha } (1 - \alpha )\operatorname{H}_\alpha (\mu , \nu ) \: , \\ \operatorname{JS}_\alpha (\mu , \nu ) & \le \alpha \operatorname{H}_{1 - \alpha }(\mu , \nu ) \: . \end{align*}
Proof

The main tool is Lemma 2.3.65. \(\operatorname{JS}_\alpha \) is an \(f\)-divergence with function \(f(x) = \alpha x \log (x) - (\alpha x + 1 - \alpha ) \log (\alpha x + 1 - \alpha )\). On probability measures, \((1 - \lambda )\operatorname{H}_\lambda \) is an \(f\)-divergence with function \(g(x) = 1 - x^\lambda + \lambda (x-1)\). These two functions satisfy \(f(1) = 0\), \(g(1) = 0\), \(f'(1) = 0\), \(g'(1) = 0\) and have second derivatives, hence Lemma 2.3.65 applies and to obtain the inequality it suffices to compute \(\sup _x\frac{f''(x)}{g''(x)}\) .

\begin{align*} f”(x) & = \frac{\alpha (1 - \alpha )}{x(\alpha x + 1 - \alpha )} \: , \\ g”(x) & = \lambda (1 - \lambda ) x^{\lambda - 2} \: . \end{align*}

Let \(h(x) = \frac{f''(x)}{g''(x)} = \frac{\alpha (1 - \alpha )}{\lambda (1 - \lambda )} \frac{x^{1 - \lambda }}{\alpha x + 1 - \alpha }\) . Its maximum is attained at a point where \(h'(x) = 0\).

\begin{align*} h’(x) = \frac{\alpha (1 - \alpha )}{\lambda (1 - \lambda )} \frac{(1 - \alpha )(1 - \lambda )x^{-\lambda } - \alpha \lambda x^{1 - \lambda }}{(\alpha x + 1 - \alpha )^2} \: . \end{align*}

The optimum for \(h\) is attained at \(x = \frac{(1 - \alpha )(1 - \lambda )}{\alpha \lambda }\) and the maximal value of \(h\) is \(\frac{(1 - \alpha )^{1 - \lambda } \alpha ^\lambda }{\lambda ^{1 - \lambda } (1 - \lambda )^\lambda }\) .

For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha \in (0,1)\), \(\lambda \le 1/2\) ,

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & \le \alpha ^{1 - \lambda } \operatorname{H}_{1 - \lambda }(\mu , \nu ) \: . \end{align*}
Proof

By Lemma 2.9.7 for \(\alpha \) and \(1 - \lambda \),

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & \le \frac{(1 - \alpha )^{\lambda } \alpha ^{1 - \lambda }}{\lambda ^{1 - \lambda } (1 - \lambda )^\lambda } \lambda \operatorname{H}_{1 - \lambda }(\mu , \nu ) \\ & = \alpha ^{1 - \lambda } (1 - \alpha )^\lambda \left(\frac{\lambda }{1 - \lambda }\right)^\lambda \operatorname{H}_{1 - \lambda }(\mu , \nu ) \: . \end{align*}

To get the result, use \(1 - \alpha \le 1\) and \(\frac{\lambda }{1 - \lambda } \le 1\) since it is increasing in \(\lambda \) with value 1 at \(\lambda = 1/2\).