2 Information divergences

A divergence between measures is a function which maps two measures on any common space \(\mathcal X\) to a value in \([-\infty , +\infty ]\). For two probability measures \(\mu \) and \(\nu \), we will want the divergence \(D(\mu , \nu )\) to be non-negative and to be zero if \(\mu = \nu \) : divergences are meant to be notions of dissimilarity between measures.

We will pay particular attention to how the divergence behaves with respect to transformations of the measures. For a Markov kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) , a fundamental property of all divergences we will consider is the data-processing inequality, which states that \(D(\kappa \circ \mu , \kappa \circ \nu ) \le D(\mu , \nu )\). That is, processing data with a (possibly random) common transformation \(\kappa \) can only lose information about the difference between the two measures.

Another important concept is the conditional divergence of two kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) given a measure \(\mu \in \mathcal M(\mathcal X)\). This is the value \(D(\mu \otimes \kappa , \mu \otimes \eta )\).

TODO: two possible concepts of conditional divergence are \(D(\mu \otimes \kappa , \mu \otimes \eta )\) and \(\mu \left[x \mapsto D(\kappa (x), \eta (x))\right]\). They are equal for \(f\)-divergences (which gives also convexity of the divergence) but not in general.

2.1 Generic divergences

Definition 10 Divergence

A divergence between measures is a function \(D\) which for any measurable space \(\mathcal X\) and any two measures \(\mu , \nu \in \mathcal M(\mathcal X)\), returns a value \(D(\mu , \nu ) \in \mathbb {R} \cup \{ +\infty \} \).

Lean Remark 1

In Lean, such a definition for a divergence would have to restrict the possible measurable spaces to be in a common universe. We don’t implement that definition, but will only implement particular divergences.

Definition 11 Data-processing inequality

A divergence \(D\) is said to satisfy the data-processing inequality (DPI) if for all measurable spaces \(\mathcal X, \mathcal Y\), all \(\mu , \nu \in \mathcal M(\mathcal X)\) and all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),

\begin{align*} D(\kappa \circ \mu , \kappa \circ \nu ) \le D(\mu , \nu ) \: . \end{align*}

Lemma 2.1.1 Second marginal

Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\). Then

\begin{align*} D(\kappa \circ \mu , \eta \circ \nu ) \le D(\mu \otimes \kappa , \nu \otimes \eta ) \: . \end{align*}

Proof ▶

Use the DPI for the deterministic kernel of the function \((x,y) \mapsto y\).

Lemma 2.1.2 First marginal

Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then

\begin{align*} D(\mu , \nu ) \le D(\mu \otimes \kappa , \nu \otimes \eta ) \: . \end{align*}

Proof ▶

Use the DPI for the deterministic kernel of the function \((x,y) \mapsto x\).

Lemma 2.1.3

Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then

\begin{align*} D(\mu \otimes \kappa , \nu \otimes \kappa ) = D(\mu , \nu ) \: . \end{align*}

Proof ▶

Lemma 2.1.2 gives one inequality. The other inequality is the DPI: \(\mu \otimes \kappa = (\mathrm{id} \times \kappa ) \circ \mu \).

Definition 12 Conditional divergence

Let \(D\) be a divergence. The conditional divergence of kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) with respect to a measure \(\mu \in \mathcal M(\mathcal X)\) is \(\mu [x \mapsto D(\kappa (x), \eta (x)]\). It is denoted by \(D(\kappa , \eta \mid \mu )\).

For the important class of \(f\)-divergences that we will introduce later, \(D(\kappa , \eta \mid \mu ) = D(\mu \otimes \kappa , \mu \otimes \eta )\).

Lemma 2.1.4 Conditioning increases divergence

Let \(D\) be a divergence that satisfies the DPI and for which \(D(\kappa , \eta \mid \mu ) = D(\mu \otimes \kappa , \mu \otimes \eta )\). Let \(\mathcal\mu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then

\begin{align*} D(\kappa \circ \mu , \eta \circ \mu ) \le D(\kappa , \eta \mid \mu ) \: . \end{align*}

Proof ▶

This is a special case of Lemma 2.1.1.

2.2 Statistical divergences

We define a family of divergences based on the risk of a binary estimation task.

2.2.1 Statistical information

Definition 13

✓

The statistical information between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(\mathcal I_\xi (\mu , \nu ) = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \mathcal B_\xi (\mu , \nu )\). This is the risk increase \(I_\xi ^P(d_{\mathcal X})\) in the binary hypothesis testing problem for \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) the Markov kernel to the point space.

The statistical information is the difference between the minimal risk of an estimator that does not depend on the data and that of a Bayes estimator. This is a simple generalization of both the DeGroot statistical information as well as the hockey-stick (or \(E_\gamma \)) divergence.

Lemma 2.2.1

✓

For \(\mu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \mu ) = 0\) .

Proof ▶

Use Lemma 1.2.6.

Lemma 2.2.2

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \nu ) \ge 0\) .

Proof ▶

Use Lemma 1.2.8.

Lemma 2.2.3

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\mathcal I_\xi (\mu , \nu ) \le \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} \) .

Proof ▶

Use Lemma 1.2.7.

Lemma 2.2.4

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\xi \in \mathcal M(\{ 0,1\} )\), \(\mathcal I_\xi (\mu , \nu ) = \mathcal I_{\xi _\leftrightarrow }(\nu , \mu )\) .

Proof ▶

Use Lemma 1.2.9.

Theorem 2.2.5 Data-processing inequality

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\xi \in \mathcal M(\{ 0,1\} )\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\mathcal I_\xi (\kappa \circ \mu , \kappa \circ \nu ) \le \mathcal I_\xi (\mu , \nu )\) .

Proof ▶

Since \(\kappa \) is a Markov kernel,

\begin{align*} \mathcal I_\xi (\kappa \circ \mu , \kappa \circ \nu ) & = \min \{ \xi _0(\kappa \circ \mu )(\mathcal X), \xi _1(\kappa \circ \nu )(\mathcal X)\} - \mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \\ & = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \end{align*}

It suffices to prove \(\mathcal B_\xi (\kappa \circ \mu , \kappa \circ \nu ) \ge \mathcal B_\xi (\mu , \nu )\) since \(\mathcal I_\xi (\mu , \nu ) = \min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} - \mathcal B_\xi (\mu , \nu )\). The inequality was proved in Theorem 1.2.5.

Alternatively, we can use Lemma 1.1.19.

Corollary 2.2.6

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\), \(\xi \in \mathcal M(\{ 0,1\} )\) and let \(E\) be an event on \(\mathcal X\). Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\mathcal I_\xi (\mu , \nu ) \ge \mathcal I_\xi (\mu _E, \nu _E)\).

Proof ▶

Apply Theorem 2.2.5 to the deterministic kernel \(\mathcal X \rightsquigarrow \{ 0,1\} \) given by the function \(x \mapsto \mathbb {I}\{ x \in E\} \).

Lemma 2.2.7

✓

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} - \zeta \left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d\zeta }(x), \xi _1\frac{d \nu }{d\zeta }(x)\right\} \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof ▶

Apply Theorem 1.2.12 to get that result for \(\zeta = P \circ \xi \). Then for other \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) , use \(\frac{d \mu }{d\zeta } = \frac{d \mu }{d(P \circ \xi )} / \frac{d (P \circ \xi )}{d\zeta }\) .

Lemma 2.2.8

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = - \frac{1}{2} \left\vert \xi _0\mu (\mathcal X) - \xi _1\nu (\mathcal X) \right\vert + \frac{1}{2} \zeta \left[x \mapsto \left\vert \xi _0\frac{d \mu }{d\zeta }(x) - \xi _1\frac{d \nu }{d\zeta }(x)\right\vert \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof ▶

Apply Corollary 1.2.13 and write \(\min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} = \frac{1}{2}\left((P \circ \xi )(\mathcal X) - \left\vert \xi _0\mu (\mathcal X) - \xi _1\nu (\mathcal X) \right\vert \right)\) to get the result for \(\zeta = P \circ \xi \). The result for other \(\zeta \) follows from simple manipulations of Radon-Nikodym derivatives.

Theorem 2.2.9 Integral form of the statistical information

✓

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _0\frac{d \mu }{d\nu }(x) - \xi _1 \right\} \right] + \xi _0 \mu _{\perp \nu }(\mathcal X) & \text{ if } \xi _0 \mu (\mathcal X) \le \xi _1 \nu (\mathcal X) \: , \\ \mathcal I_\xi (\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _1 - \xi _0\frac{d \mu }{d\nu }(x) \right\} \right] & \text{ if } \xi _0 \mu (\mathcal X) \ge \xi _1 \nu (\mathcal X) \: . \end{align*}

For probability measures, this theorem makes the statistical information an \(f\)-divergence (which is defined later in this document).

Proof ▶

By Theorem 1.2.12,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \min \{ \xi _0\mu (\mathcal X), \xi _1\nu (\mathcal X)\} - (P \circ \xi )\left[x \mapsto \min \left\{ \xi _0\frac{d \mu }{d(P \circ \xi )}(x), \xi _1\frac{d \nu }{d(P \circ \xi )}(x)\right\} \right] \: . \end{align*}

Suppose that \(\xi _0\mu (\mathcal X) \le \xi _1\nu (\mathcal X)\) . Then

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \xi _0\mu (\mathcal X) - (P \circ \xi )\left[x \mapsto \xi _0\frac{d \mu }{d(P \circ \xi )}(x) + \min \left\{ 0 , \xi _1\frac{d \nu }{d(P \circ \xi )}(x) - \xi _0\frac{d \mu }{d(P \circ \xi )}(x)\right\} \right] \\ & = - (P \circ \xi )\left[x \mapsto \min \left\{ 0 , \xi _1\frac{d \nu }{d(P \circ \xi )}(x) - \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \right\} \right] \\ & = (P \circ \xi )\left[x \mapsto \max \left\{ 0 , \xi _0\frac{d \mu }{d(P \circ \xi )}(x) - \xi _1\frac{d \nu }{d(P \circ \xi )}(x) \right\} \right] \\ & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _0\frac{d \mu }{d\nu }(x) - \xi _1 \right\} \right] + \xi _0 \mu _{\perp \nu }(\mathcal X) \: . \end{align*}

If on the other hand \(\xi _0\mu (\mathcal X) \ge \xi _1\nu (\mathcal X)\) , with similar computations,

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = (P \circ \xi )\left[x \mapsto \max \left\{ 0 , \xi _1\frac{d \nu }{d(P \circ \xi )}(x) - \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \right\} \right] \\ & = \nu \left[ x \mapsto \max \left\{ 0 , \xi _1 - \xi _0\frac{d \mu }{d\nu }(x) \right\} \right] \: . \end{align*}

Corollary 2.2.10

✓

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = -\frac{1}{2} \left\vert \xi _0 \mu (\mathcal X) - \xi _1 \nu (\mathcal X)\right\vert + \frac{1}{2}\left( \nu \left[ x \mapsto \left\vert \xi _0\frac{d \mu }{d\nu }(x) - \xi _1 \right\vert \right] + \xi _0 \mu _{\perp \nu }(\mathcal X)\right) \: . \end{align*}

Proof ▶

Start from Theorem 2.2.9, then use \(\max \{ a,b\} = \frac{1}{2}\left( a + b + \vert a - b \vert \right)\) . The two cases of Theorem 2.2.9 lead to the same expression.

Lemma 2.2.11

✓

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \inf _{E \text{ event}} \left( \xi _0 \mu (E) + \xi _1 \nu (E^c) \right) \: . \end{align*}

Proof ▶

This is a direct application of Lemma 1.2.11.

Lemma 2.2.12

For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),

\begin{align*} \mathcal I_\xi (\mu , \nu ) & = - \max \{ 0, \xi _1 \nu (\mathcal X) - \xi _0 \mu (\mathcal X) \} + \sup _{E \text{ event}} \left( \xi _1 \nu (E) - \xi _0 \mu (E) \right) \\ & = - \max \{ 0, \xi _0 \mu (\mathcal X) - \xi _1 \nu (\mathcal X) \} + \sup _{E \text{ event}} \left( \xi _0 \mu (E) - \xi _1 \nu (E) \right) \: . \end{align*}

Proof ▶

TODO: find a way to hide the following dummy lemma

Lemma 2.2.13 Dummy lemma: statInfo properties

✓

Dummy node to summarize properties of the statistical information.

Proof ▶

Definition 14

✓

The DeGroot statistical information between finite measures \(\mu \) and \(\nu \) for \(\pi \in [0,1]\) is \(I_\pi (\mu , \nu ) = \mathcal I_{(\pi , 1 - \pi )}(\mu , \nu )\) .

Definition 15

✓

The \(E_\gamma \) or hockey-stick divergence between finite measures \(\mu \) and \(\nu \) for \(\gamma \in (0,+\infty )\) is \(E_\gamma (\mu , \nu ) = \mathcal I_{(1,\gamma )}(\mu , \nu )\) .

Note that our definition of the \(E_\gamma \) divergence extends the one from the literature: the divergence is usually defined for \(\gamma \ge 1\) only. The extension makes sense in light of Theorem 2.3.35. It is also extended from probability measures to finite measures.

2.2.2 Total variation distance

Definition 16

✓

The total variation distance between finite measures \(\mu \) and \(\nu \) is \(\operatorname{TV}(\mu , \nu ) = \mathcal I_{(1,1)}(\mu , \nu )\) .

This is also equal to \(2 I_{1/2}(\mu , \nu )\) and to \(E_1(\mu , \nu )\).

Lemma 2.2.14

✓

For \(\mu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \mu ) = 0\) .

Proof ▶

Use Lemma 2.2.1.

Lemma 2.2.15

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) \ge 0\) .

Proof ▶

Use Lemma 2.2.2.

Lemma 2.2.16

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) \le \min \{ \mu (\mathcal X), \nu (\mathcal X)\} \) .

Proof ▶

Use Lemma 2.2.3.

Theorem 2.2.17 Data-processing inequality

✓

For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(\operatorname{TV}(\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{TV}(\mu , \nu )\) .

Proof ▶

Apply Theorem 2.2.5.

TODO: find a way to hide the following dummy lemma

Lemma 2.2.18 Dummy lemma: TV properties

✓

Dummy node to summarize properties of the total variation distance.

Proof ▶

Lemma 2.2.19

For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \min \{ \mu (\mathcal X), \nu (\mathcal X)\} - \zeta \left[x \mapsto \min \left\{ \frac{d \mu }{d\zeta }(x), \frac{d \nu }{d\zeta }(x)\right\} \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof ▶

Apply Lemma 2.2.7.

Lemma 2.2.20

For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,

\begin{align*} \operatorname{TV}(\mu , \nu ) & = - \frac{1}{2} \left\vert \mu (\mathcal X) - \nu (\mathcal X) \right\vert + \frac{1}{2} \zeta \left[x \mapsto \left\vert \frac{d \mu }{d\zeta }(x) - \frac{d \nu }{d\zeta }(x)\right\vert \right] \: . \end{align*}

This holds in particular for \(\zeta = P \circ \xi \).

Proof ▶

Apply Lemma 2.2.8.

Lemma 2.2.21 Integral form of the total variation distance

For finite measures \(\mu , \nu \),

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , \frac{d \mu }{d\nu }(x) - 1 \right\} \right] + \mu _{\perp \nu }(\mathcal X) & \text{ if } \mu (\mathcal X) \le \nu (\mathcal X) \: , \\ \operatorname{TV}(\mu , \nu ) & = \nu \left[ x \mapsto \max \left\{ 0 , 1 - \frac{d \mu }{d\nu }(x) \right\} \right] & \text{ if } \mu (\mathcal X) \ge \nu (\mathcal X) \: . \end{align*}

Proof ▶

Apply Theorem 2.2.9.

Lemma 2.2.22

For finite measures \(\mu , \nu \),

\begin{align*} \operatorname{TV}(\mu , \nu ) & = -\frac{1}{2} \left\vert \mu (\mathcal X) - \nu (\mathcal X)\right\vert + \frac{1}{2}\left( \nu \left[ x \mapsto \left\vert \frac{d \mu }{d\nu }(x) - 1 \right\vert \right] + \mu _{\perp \nu }(\mathcal X)\right) \: . \end{align*}

Proof ▶

Apply Corollary 2.2.10.

Lemma 2.2.23

For finite measures \(\mu , \nu \),

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \min \{ \mu (\mathcal X), \nu (\mathcal X)\} - \inf _{E \text{ event}} \left( \mu (E) + \nu (E^c) \right) \: . \end{align*}

Proof ▶

Apply Lemma 2.2.11.

Theorem 2.2.24

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) with \(\mu (\mathcal X) \leq \nu (\mathcal X)\).
Then \(TV(\mu , \nu ) = \sup _{E \text{ event}} \left( \mu (E) - \nu (E) \right)\).

Proof ▶

Starting from Lemma 2.2.23,

\begin{align*} \operatorname{TV}(\mu , \nu ) & = \min \{ \mu (\mathcal X), \nu (\mathcal X)\} - \inf _{E \text{ event}} \left( \mu (E) + \nu (E^c) \right) \\ & = \mu (\mathcal X) - \inf _{E \text{ event}} \left( \mu (E^c) + \nu (E) \right) \\ & = \mu (\mathcal X) - \mu (\mathcal X) - \inf _{E \text{ event}} \left( -\mu (E) + \nu (E) \right) \\ & = \sup _{E \text{ event}} \left( \mu (E) - \nu (E) \right) \: . \end{align*}

Lemma 2.2.25

Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right) \le \operatorname{TV}(\mu , \nu )\).

Proof ▶

Let \(p,q\) be the respective densities of \(\mu , \nu \) with respect to \(\xi =\mu +\nu \). For any \(f \in \mathcal F\),

\begin{align*} \mu [f] - \nu [f] = \int _x f(x)(p(x) - q(x)) \partial \xi & \le \int _x \Vert f(x) \Vert _\infty \vert p(x) - q(x) \vert \partial \xi \\ & \le \int _x \vert p(x) - q(x) \vert \partial \xi = 2 \operatorname{TV}(\mu , \nu ) \: . \end{align*}

The last equality is Lemma 2.2.20.

Theorem 2.2.26

Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) = \frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right)\).

Proof ▶

Lemma 2.2.25 gives \(\operatorname{TV}(\mu , \nu ) \ge \frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right)\).

The bayes estimator is a function of the form \(x \mapsto \mathbb {I}\{ x \in E\} \) for some event \(E\), hence it belongs to \(\mathcal F\). This gives equality.

Lemma 2.2.27

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\operatorname{TV}(\mu , \nu ) \ge \operatorname{TV}(\mu _E, \nu _E)\).

Proof ▶

Apply Corollary 2.2.6.

2.3 f-divergences

2.3.1 Definition and basic properties

Everywhere in this document, the functions used in divergences are assumed measurable, continuous and convex on \([0, +\infty )\). For such a function \(f : \mathbb {R} \to \mathbb {R}\), we write \(f'(\infty ) = \lim _{x \to +\infty } f(x)/x\) (which can be infinite).

For \(\mu , \nu \) two measures, we denote by \(\mu _{\perp \nu }\) the singular part of \(\mu \) with respect to \(\nu \).

Definition 17 f-divergence

✓

Let \(f : \mathbb {R} \to \mathbb {R}\) and let \(\mu , \nu \) be two measures on a measurable space \(\mathcal X\). The f-divergence between \(\mu \) and \(\nu \) is

\begin{align*} D_f(\mu , \nu ) = \nu \left[x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\right] + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \end{align*}

if \(x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\) is \(\nu \)-integrable and \(+\infty \) otherwise.

Lemma 2.3.1

✓

For \(\mu \) and \(\nu \) two finite measures, \(D_f(\mu , \nu )\) is finite if and only if \(x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\) is \(\nu \)-integrable and either \(f'(\infty ) {\lt} \infty \) or \(\mu \ll \nu \).

Proof ▶

Divergences associated with basic (transformations of) functions

Lemma 2.3.2

✓

For \(\nu \) a finite measure, for all \(a \in \mathbb {R}\), \(D_{x \mapsto a}(\mu , \nu ) = a \nu (\mathcal X)\).

Proof ▶

Compute the integral.

Lemma 2.3.3

✓

If \(f(1) = 0\) then \(D_{f}(\mu , \mu ) = 0\).

Proof ▶

\(\frac{d \mu }{d \mu }(x) = 1\) almost everywhere and \(f(1) = 0\).

Lemma 2.3.4

✓

Let \(\mu , \nu \) be two probability measures. Assume \(f'(\infty ) = + \infty \) and \(f(1) = 0\). Then \(D_f(\mu , \nu ) = 0\) if and only if \(\mu = \nu \).

Proof ▶

If \(\mu = \nu \) then \(D_f(\mu , \nu ) = 0\) by Lemma 2.3.3. For the other direction use the strict version of Jensen’s inequality.

Lemma 2.3.5

✓

\(D_{x \mapsto x}(\mu , \nu ) = \mu (\mathcal X)\).

Proof ▶

Compute the integral: its value is \((\frac{d\mu }{d\nu }\cdot \nu )(\mathcal X)\). Then \(D_{x\mapsto x}(\mu , \nu ) = (\frac{d\mu }{d\nu }\cdot \nu )(\mathcal X) + \mu _{\perp \nu }(\mathcal X) = \mu (\mathcal X)\).

Lemma 2.3.6

✓

For all \(a \ge 0\), \(D_{a f}(\mu , \nu ) = a D_{f}(\mu , \nu )\).

Proof ▶

Linearity of the integral.

Lemma 2.3.7

✓

\(D_{f + g}(\mu , \nu ) = D_f(\mu , \nu ) + D_g(\mu , \nu )\).

Proof ▶

Linearity of the integral.

Lemma 2.3.8

✓

For finite measures \(\mu \) and \(\nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\), for all \(a \in \mathbb {R}\), \(D_{f + a(x - 1)}(\mu , \nu ) = D_{f}(\mu , \nu )\).

Proof ▶

Linearity (Lemmas 2.3.7 and 2.3.6), then Lemma 2.3.2 and 2.3.5.

Lemma 2.3.9

✓

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X\) and let \(g : \mathcal X \to \mathcal Y\) be a measurable embedding. Then \(D_f(g_* \mu , g_* \nu ) = D_f(\mu , \nu )\).

Proof ▶

Lemma 2.3.10

\(D_f(\mu , \nu ) = D_{x \mapsto xf(1/x)}(\nu , \mu )\) .

Proof ▶

Lemma 2.3.11

\((\mu , \nu ) \mapsto D_f(a \mu + b \nu , \nu )\) is an \(f\)-divergence for the function \(x \mapsto f(ax + b)\)

Proof ▶

Lemma 2.3.12

\((\mu , \nu ) \mapsto D_f(\mu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{x}{ax+b}\right)\) .

Proof ▶

We use Lemma 2.3.10, then Lemma 2.3.11 and finally Lemma 2.3.10 again:

\begin{align*} D_f(\mu , a \mu + b \nu ) & = D_{xf(1/x)}(a \mu + b \nu , \mu ) \\ & = D_{(bx+a)f(1/(bx+a))}(\nu , \mu ) \\ & = D_{(ax+b)f(x/(ax+b))}(\mu , \nu ) \: . \end{align*}

Lemma 2.3.13

\((\mu , \nu ) \mapsto D_f(\nu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{1}{ax+b}\right)\) .

Proof ▶

We use Lemma 2.3.12, then Lemma 2.3.10.

\begin{align*} D_f(\nu , a \mu + b \nu ) & = D_{(bx+a)f\left(\frac{x}{bx+a}\right)}(\nu , \mu ) = D_{(ax+b)f\left(\frac{1}{ax+b}\right)}(\mu , \nu ) \: . \end{align*}

Upper and lower bounds on \(D_f\)

Lemma 2.3.14

✓

Let \(\mu _1, \mu _2\) and \(\nu \) be finite measures on \(\mathcal X\), with \(\mu _1 \ll \nu \) and \(\mu _2 \perp \nu \). Then \(D_f(\mu _1 + \mu _2, \nu ) = D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).

Proof ▶

\(\frac{d(\mu _1 + \mu _2)}{d \nu } = \frac{d \mu _1}{d \nu }\) a.e. and \((\mu _1 + \mu _2)_{\perp \nu } = \mu _2\).

Lemma 2.3.15 Superseded by Lemma 2.3.16

✓

Let \(\mu _1, \mu _2, \nu \) be three finite measures on \(\mathcal X\) with \(\mu _1 \ll \nu \) and \(\mu _2 \ll \nu \). Then \(D_f(\mu _1 + \mu _2, \nu ) \le D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).

Proof ▶

\begin{align*} D_f(\mu _1 + \mu _2, \nu ) & = \int _x f \left( \frac{d \mu _1}{d\nu }(x) + \frac{d\mu _2}{d\nu }(x) \right) \partial \nu \\ & \le \int _x f \left( \frac{d \mu _1}{d\nu }(x) \right) + \frac{d\mu _2}{d\nu }(x) f’(\infty ) \partial \nu \\ & = D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f’(\infty ) \: . \end{align*}

Lemma 2.3.16

✓

Let \(\mu _1, \mu _2, \nu \) be three finite measures on \(\mathcal X\). Then \(D_f(\mu _1 + \mu _2, \nu ) \le D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f'(\infty )\).

Proof ▶

From Lemma 2.3.14, then Lemma 2.3.15,

\begin{align*} D_f(\mu _1 + \mu _2, \nu ) & = D_f(\frac{d\mu _1}{d \nu }\cdot \nu + \frac{d\mu _2}{d \nu }\cdot \nu , \nu ) + (\mu _1)_{\perp \nu }(\mathcal X) f’(\infty ) + (\mu _2)_{\perp \nu }(\mathcal X) f’(\infty ) \\ & \le D_f(\frac{d\mu _1}{d \nu }\cdot \nu , \nu ) + (\frac{d\mu _2}{d \nu }\cdot \nu )(\mathcal X) f’(\infty ) + (\mu _1)_{\perp \nu }(\mathcal X) f’(\infty ) + (\mu _2)_{\perp \nu }(\mathcal X) f’(\infty ) \\ & = D_f(\mu _1, \nu ) + \mu _2(\mathcal X) f’(\infty ) \: . \end{align*}

Lemma 2.3.17

✓

Let \(\mu \) and \(\nu \) be two finite measures on \(\mathcal X\). Then \(D_f(\mu , \nu ) = D_f(\frac{d\mu }{d\nu }\cdot \nu , \nu ) + f'(\infty ) \mu _{\perp \nu }(\mathcal X)\).

Proof ▶

Apply Lemma 2.3.14 to \(\frac{d\mu }{d\nu }\cdot \nu \) and \(\mu _{\perp \nu }\).

Lemma 2.3.18 Superseded by Lemma 2.3.19

✓

Let \(\mu \) be a finite measure and \(\nu \) be a probability measure on the same space \(\mathcal X\), such that \(\mu \ll \nu \). Then \(f(\mu (\mathcal X)) \le D_f(\mu , \nu )\).

Proof ▶

Since \(\mu \ll \nu \) the \(f\)-divergence is only the integral part. Then by Jensen’s inequality,

\begin{align*} D_f(\mu , \nu ) & = \nu \left[ f\left( \frac{d\mu }{d\nu } \right) \right] \ge f\left( \nu \left[\frac{d\mu }{d\nu } \right] \right) = f(\mu (\mathcal X)) \: . \end{align*}

Lemma 2.3.19

✓

Let \(\mu \) be a finite measure and \(\nu \) be a probability measure on the same space \(\mathcal X\). Then \(f(\mu (\mathcal X)) \le D_f(\mu , \nu )\).

Proof ▶

By convexity, then Lemma 2.3.18 and finally Lemma 2.3.17,

\begin{align*} f(\mu (\mathcal X)) & \le f(\frac{d\mu }{d\nu }\cdot \nu (\mathcal X)) + f’(\infty )\mu _{\perp }(\mathcal X) \\ & \le D_f(\frac{d\mu }{d\nu }\cdot \nu , \nu ) + f’(\infty )\mu _{\perp }(\mathcal X) \\ & = D_f(\mu , \nu ) \: . \end{align*}

Lemma 2.3.20

✓

Let \(\mu , \nu \) be two probability measures. If \(f(1) = 0\) then \(D_f(\mu , \nu ) \ge 0\).

Proof ▶

Apply Lemma 2.3.19 and use \(f(\mu (\mathcal X)) = f(1) = 0\).

TODO: find a way to hide the following dummy lemma

Lemma 2.3.21 Dummy lemma: fDiv properties

✓

Dummy node to summarize properties of \(f\)-divergences.

Proof ▶

2.3.2 Conditional f-divergence

TODO: replace the definition by \(D(\mu \otimes \kappa , \mu \otimes \eta )\), which allows for a more general proof of the “conditioning increases divergence” inequality.

Definition 18 Conditional f-divergence

✓

Let \(f : \mathbb {R} \to \mathbb {R}\), \(\mu \) a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels from \(\mathcal X\) to \(\mathcal Y\). The conditional f-divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is

\begin{align*} D_f(\kappa , \eta \mid \mu ) = \mu \left[x \mapsto D_f(\kappa (x), \eta (x))\right] \end{align*}

if \(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.

Lemma 2.3.22

✓

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels. If \(f(1) = 0\) then \(D_f(\kappa , \eta \mid \mu ) \ge 0\).

Proof ▶

Apply Lemma 2.3.20.

Lemma 2.3.23

✓

Let \(\mu , \nu \) be measures on \(\mathcal X\), where \(\mu \) is finite, and let \(\xi \) be a finite measure on \(\mathcal Y\). Then \(D_f(x \mapsto \mu , x \mapsto \nu \mid \xi ) = D_f(\mu , \nu ) \xi (\mathcal X)\).

Proof ▶

\[ D_f(x \mapsto \mu , x \mapsto \nu \mid \xi ) = \xi \left[x \mapsto D_f(\mu , \nu )\right] = D_f(\mu , \nu ) \xi \left[1\right] = D_f(\mu , \nu ) \xi (\mathcal X) \]

Lemma 2.3.24

Let \(\mu \) be a finite measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels from \(\mathcal X\) to \(\mathcal Y\). Then \(p \mapsto f \left(\frac{d(\mu \otimes \kappa )}{d(\mu \otimes \eta )}(p)\right)\) is \((\mu \otimes \eta )\)-integrable iff

\(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and
for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable.

Proof ▶

Since \(x \mapsto f \left(\frac{d(\mu \otimes \kappa )}{d(\mu \otimes \eta )} x \right)\) is measurable, the integrability condition w.r.t. \(\mu \otimes \eta \) is equivalent to

\(x \mapsto \int _y \left\Vert f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \right\Vert \partial \eta (x)\) is \(\mu \)-integrable and
for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable.

It suffices to use convexity to show that integrability of \(x \mapsto \int _y f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \partial \eta (x)\) implies integrability of \(x \mapsto \int _y \left\Vert f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \right\Vert \partial \eta (x)\).

TODO

Lemma 2.3.25

✓

Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, where \(\kappa \) is a Markov kernel. Then \(D_f(\kappa , \eta \mid \mu ) \ne \infty \) if and only if

for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable,
\(x \mapsto \int _y f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \partial \eta (x)\) is \(\mu \)-integrable,
either \(f'(\infty ) {\lt} \infty \) or for \(\mu \)-almost all \(x\), \(\kappa (x) \ll \eta (x)\).

Proof ▶

Lemma 2.3.26

✓

Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\mu \otimes \kappa , \mu \otimes \eta ) \ne \infty \iff D_f(\kappa , \eta \mid \mu ) \ne \infty \).

Proof ▶

Lemma 2.3.27

✓

Proof ▶

By Lemma 2.3.26, the conditions on which the two divergences are finite are the same. We then assume integrability properties such that both are finite (given by Lemma 2.3.25).

TODO: the following proof assumes \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\). Describe the general case.

By Lemma B.2.9 and Corollary B.1.2,

\begin{align*} D_f(\mu \otimes \kappa , \mu \otimes \eta ) & = \int _{p} f\left(\frac{d (\mu \otimes \kappa )}{d (\mu \otimes \eta )}(p)\right) \partial (\mu \otimes \eta ) \\ & = \int _{p} f\left(\frac{d \kappa }{d \eta }(p)\right) \partial (\mu \otimes \eta ) \\ & = \int _x \int _y f\left(\frac{d \kappa }{d \eta }(x, y)\right) \partial \eta (x) \partial \mu \\ & = \int _x \int _y f\left(\frac{d \kappa (x)}{d \eta (x)}(y)\right) \partial \eta (x) \partial \mu \\ & = \mu \left[D_f(\kappa (x), \eta (x))\right] = D_f(\kappa , \eta \mid \mu ) \: . \end{align*}

TODO: find a way to hide the following dummy lemma

Lemma 2.3.28 Dummy lemma: condFDiv properties

✓

Dummy node to summarize properties of conditional \(f\)-divergences.

Proof ▶

2.3.3 Integral representation of \(f\)-divergences

We prove that any \(f\)-divergence is an integral over priors of statistical informations, in which the measure depends on \(f\). The main references for this part are [ LV06 ] and [ Lie12 ] .

Definition 19 Curvature measure

✓

Let \(f: \mathbb {R} \to \mathbb {R}\) be a convex function. Then its right derivative \(f'_+(x) \coloneqq \lim _{y \downarrow x}\frac{f(y) - f(x)}{y - x}\) is a Stieltjes function (a monotone right continuous function) and it defines a measure \(\gamma _f\) on \(\mathbb {R}\) by \(\gamma _f((x,y]) \coloneqq f'_+(y) - f'_+(x)\) . [ Lie12 ] calls \(\gamma _f\) the curvature measure of \(f\).

Lemma 2.3.29

For \(a \ge 0\) and \(f: \mathbb {R} \to \mathbb {R}\) a convex function, the curvature measure of \(af\) is \(\gamma _{af} = a \gamma _f\) .

Proof ▶

Lemma 2.3.30

For \(f,g: \mathbb {R} \to \mathbb {R}\) two convex functions, the curvature measure of \(f+g\) is \(\gamma _{f+g} = \gamma _f + \gamma _g\) .

Proof ▶

Theorem 2.3.31

If \(f\) and \(g\) are two Stieltjes functions with associated measures \(\mu _f\) and \(\mu _g\) and \(f\) is continuous on \([a, b]\), then

\begin{align*} \int _x f(x) \partial \mu _g = f(b)g(b) - f(a)g(a) - \int _x g(x) \partial \mu _f \: . \end{align*}

Proof ▶

Lemma 2.3.32

✓

For \(f: \mathbb {R} \to \mathbb {R}\) a convex function and \(x,y \in \mathbb {R}\),

\begin{align*} f(y) - f(x) - (y - x)f’_+(x) & = \int _{z \in (x,y]} (y - z) \partial \gamma _f & \text{ if } x \le y \: , \\ f(y) - f(x) - (y - x)f’_+(x) & = \int _{z \in (y,x]} (z - y) \partial \gamma _f & \text{ if } y \le x \: . \end{align*}

Proof ▶

Let \(\Lambda \) be the Lebesgue measure and let \(x {\lt} y\). Since \(f\) has right derivative \(f'_+\) in \((x,y)\) and \(f'_+\) is integrable on that interval,

\begin{align*} f(y) - f(x) = \int _{z \in (x, y]} f’_+(z) \partial \Lambda \: . \end{align*}

We now integrate by parts, using Theorem 2.3.31. \(\Lambda \) is the measure obtained from the Stieltjes function \(z \mapsto z - y\).

\begin{align*} \int _{z \in (x, y]} f’_+(z) \partial \Lambda & = - \int _{z \in (x,y]} (z - y)\partial \gamma _f + f’_+(y)(y - y) - f’_+(x)(x - y) \\ & = \int _{z \in (x,y]} (y - z)\partial \gamma _f + f’_+(x)(y - x) \: . \end{align*}

Putting the two equalities together, we get the result for \(x {\lt} y\). The proof for \(x {\gt} y\) is similar, and for \(x = y\) both sides of both equations are zero.

Definition 20

✓

For \(a,b \in (0, +\infty )\) let \(\phi _{a,b} : \mathbb {R} \to \mathbb {R}\) be the function defined by

\begin{align*} \phi _{a,b}(x) & = \max \left\{ 0, a x - b \right\} & \text{ for } a \le b \: , \\ \phi _{a,b}(x) & = \max \left\{ 0, b - a x \right\} & \text{ for } a {\gt} b \: . \end{align*}

Corollary 2.3.33

✓

For \(f: \mathbb {R} \to \mathbb {R}\) a convex function, for all \(x \in \mathbb {R}\) ,

\begin{align*} f(x) & = f(1) + f’_+(1) (x - 1) + \int _{y} \phi _{1,y}(x) \partial \gamma _f \: . \end{align*}

Proof ▶

By Lemma 2.3.32, for \(x {\gt} 1\),

\begin{align*} f(x) - f(1) - f’_+(1) (x - 1) & = \int _{y \in (1, x]} (x - y) \partial \gamma _f \\ & = \int _{y \in (1, +\infty )} \max \{ 0, x - y\} \partial \gamma _f \\ & = \int _{y \in (1, +\infty )} \phi _{1,y}(x) \partial \gamma _f \: . \end{align*}

That final integral is equal to \(\int _y \phi _{1,y}(x) \partial \gamma _f\) (without the restriction to \((1, +\infty )\)) because \(\phi _{1,y}(x)\) is zero outside of \((1, +\infty )\) for \(x \ge 1\).

The case of \(x {\lt} 1\) is similar.

Lemma 2.3.34

The curvature measure of the function \(\phi _{a,b}\) is \(\gamma _{\phi _{a,b}} = a\delta _{b/a}\) , where \(\delta _x\) is the Dirac measure at \(x\).

Proof ▶

The following theorem is inspired from [ LV06 , Lie12 ] but gives a different integral representation.

Theorem 2.3.35

✓

For two finite measures \(\mu , \nu \in \mathcal M(\mathcal X)\),

\begin{align*} D_f(\mu , \nu ) = f(1) \nu (\mathcal X) + f’_+(1)(\mu (\mathcal X) - \nu (\mathcal X)) + \int _y D_{\phi _{1,y}}(\mu , \nu ) \partial \gamma _f \: . \end{align*}

For probability measures, \(D_{\phi _{1,y}}(\mu , \nu ) = E_y(\mu , \nu )\) , but these functions can differ if \(\mu \) and \(\nu \) are finite measures with different masses.

Proof ▶

We prove the result for \(f(1) = 0\) and \(f'_+(1) = 0\) for simplicity of exposition.

From Corollary 2.3.33 we get that \(f(x) = \int _y \phi _{1,y}(x) \partial \gamma _f\) . From this and the theorem of Fubini,

\begin{align*} \int _x f\left( \frac{d \mu }{d \nu }(x) \right) \partial \nu = \int _{x} \int _{y} \phi _{1,y} \left(\frac{d \mu }{d \nu }(x) \right) \partial \gamma _f \partial \nu = \int _{y} \int _{x} \phi _{1,y} \left(\frac{d \mu }{d \nu }(x) \right) \partial \nu \partial \gamma _f \: . \end{align*}

For the mutually singular part of the divergence \(D_f\),

\begin{align*} f’(\infty ) = f’_+(\infty ) - f’_+(1) = \gamma _f((1,+\infty )) = \int _{y {\gt} 1} 1 \partial \gamma _f = \int _{y {\gt} 1} \phi _{1,y}’(\infty ) \partial \gamma _f \: . \end{align*}

The last integral is equal to \(\int _{y \in \mathbb {R}} \phi _{1,y}'(\infty ) \partial \gamma _f\) since \(\phi _{1,y}'(\infty ) = 0\) for \(y \le 1\). Finally,

\begin{align*} D_f(\mu , \nu ) & = \int _x f\left( \frac{d \mu }{d \nu }(x) \right) \partial \nu + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \\ & = \int _{y} \left(\int _{x} \phi _{1,y} \left(\frac{d \mu }{d \nu }(x) \right) \partial \nu + \phi _{1,y}’(\infty ) \mu (\mathcal X) \right) \partial \gamma _f \\ & = \int _y D_{\phi _{1,y}}(\mu , \nu ) \partial \gamma _f \: . \end{align*}

TODO: define \(\Gamma _f\).

TODO: in the lemma below, \(I_\pi (\mu , \nu )\) is the correct value for probability measures, but in general it should be replaced by \(D_{\phi _{\pi , 1 - \pi }}(\mu , \nu )\) . These functions can differ if \(\mu \) and \(\nu \) are finite measures with different masses.

Theorem 2.3.36 [ LV06 , Lie12 ]

\begin{align*} D_f(\mu , \nu ) = f(1) \nu (\mathcal X) + f’_+(1)(\mu (\mathcal X) - \nu (\mathcal X)) + \int _\pi I_\pi (\mu , \nu ) \partial \Gamma _f \: . \end{align*}

Proof ▶

2.3.4 Data-processing inequality

This section contains three proofs of the data-processing inequality for \(f\)-divergences. That inequality states that for \(\mu , \nu \) two measures on \(\mathcal X\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).

A first proof uses the convexity of \(f\) through Jensen’s inequality for conditional expectations. We first prove the DPI for measurable functions (or deterministic kernels) then use it to extend to the general case.
A second proof deals with the general case and uses Radon-Nikodym derivatives of kernels. This restrict its applicability to measurable spaces that satisfy some particular countability assumptions.
The last proof uses the representation of \(f\)-divergence as an integral over statistical (risk-based) divergences, then uses the data-processing inequality for those divergences.

Lean Remark 2

Currently we don’t have Lean proofs of Jensen’s inequality for conditional expectations (used in proof 1) or of an integration by part lemma that is used in proof 3. Proof 2 gives the weakest results since it is restricted to spaces that satisfy countability assumptions, but it’s the only one for which our Lean proof is sorry-free.

We first remark that it is sufficient to prove the DPI under an absolute continuity assumption on the measures.

Lemma 2.3.37

✓

Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel and \(\nu \in \mathcal M(\mathcal X)\) be a finite measure. Suppose that for all finite measures \(\mu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \), \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\). Then the same is true without the absolute continuity hypothesis.

Proof ▶

We use \(\kappa \circ \mu = \kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ) + \kappa \circ \mu _{\perp \nu }\) . By Lemma 2.3.16,

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f\left(\kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ), \kappa \circ \nu \right) + (\kappa \circ \mu _{\perp \nu })(\mathcal Y) f’(\infty ) \: . \end{align*}

Since \(\kappa \) is a Markov kernel, \((\kappa \circ \mu _{\perp \nu })(\mathcal Y) = \mu _{\perp \nu }(\mathcal Y)\).

By hypothesis, since \(\frac{d\mu }{d\nu }\cdot \nu \ll \nu \), \(D_f\left(\kappa \circ (\frac{d\mu }{d\nu }\cdot \nu ), \kappa \circ \nu \right) \le D_f\left(\frac{d\mu }{d\nu }\cdot \nu , \nu \right)\). We get

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) & \le D_f\left(\frac{d\mu }{d\nu }\cdot \nu ,\nu \right) + \mu _{\perp \nu }(\mathcal Y) f’(\infty ) \\ & = D_f(\mu , \nu ) \: . \end{align*}

The next Lemma implies that in order to prove the DPI, it is sufficient to prove \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \kappa )\).

Lemma 2.3.38 Composition-product with a kernel

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\mu \otimes \kappa , \nu \otimes \kappa ) = D_f(\mu , \nu )\).

Proof ▶

By Lemma B.2.5 and the fact that \((\mu \otimes \kappa )_{\perp \nu \otimes \kappa } = \mu _{\perp \nu } \otimes \kappa \) (Lemma TODO),

\begin{align*} D_f(\mu \otimes \kappa , \nu \otimes \kappa ) & = \int _{p} f \left(\frac{d (\mu \otimes \kappa )}{d (\nu \otimes \kappa )}(p)\right) \partial (\nu \otimes \kappa ) + f’(\infty ) (\mu \otimes \kappa )_{\perp \nu \otimes \kappa }(\mathcal X) \\ & = \int _{p} f \left(\frac{d \mu }{d \nu }(p_X)\right) \partial (\nu \otimes \kappa ) + f’(\infty ) (\mu _{\perp \nu } \otimes \kappa )(\mathcal X) \\ & = \int _x \int _y f \left(\frac{d \mu }{d \nu }(x)\right) \partial \kappa (x) \partial \nu + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \\ & = \int _x f \left(\frac{d \mu }{d \nu }(x)\right) \partial \nu + f’(\infty ) \mu _{\perp \nu }(\mathcal X) \\ & = D_f(\mu , \nu ) \: . \end{align*}

Corollary 2.3.39

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\xi \) be a measure on \(\mathcal Y\). Then \(D_f(\mu \times \xi , \nu \times \xi ) = D_f(\mu , \nu )\).

Proof ▶

Apply Lemma 2.3.38 with \(\kappa \) the constant kernel with value \(\xi \).

DPI: proof using Jensen’s inequality for the conditional expectation

This proof lifts the DPI on deterministic kernels to a DPI on all kernels. We first prove the DPI for the special cases of a composition with a deterministic kernel, which is the same as taking the map of the measures by a function.

Theorem 2.3.40

✓

Let \(\mu , \nu \in \mathcal M (\mathcal X)\) be two finite measures and let \(g : \mathcal X \to \mathcal Y\) be a measurable function. Then \(D_f(g_* \mu , g_* \nu ) \le D_f(\mu , \nu )\).

Proof ▶

By Lemma 2.3.37, it is sufficient to prove the inequality with the additional hypothesis \(\mu \ll \nu \) (hence also \(g_* \mu \ll g_*\nu \)). In that case,

\begin{align*} D_f(g_* \mu , g_* \nu ) & = \int _y f \left( \frac{d g_* \mu }{d g_* \nu }(y) \right) \partial (g_* \nu ) \\ & = \int _{x} f \left( \frac{d g_* \mu }{d g_* \nu }(g(x)) \right) \partial \nu \: . \end{align*}

By Lemma B.3.1, \(\nu \)-almost everywhere,

\begin{align*} \frac{d g_* \mu }{d g_* \nu }(g(x)) & = \nu \left[\frac{d \mu }{d \nu } \mid g^* \mathcal Y \right](x) \: . \end{align*}

Using that expression, then Jensen’s inequality for the conditional expectation(Theorem C.0.2), we get

\begin{align*} D_f(g_* \mu , g_* \nu ) & = \int _{x} f \left( \nu \left[\frac{d \mu }{d \nu } \mid g^* \mathcal Y \right](x) \right) \partial \nu \\ & \le \int _{x} \nu \left[ f \left( \frac{d \mu }{d \nu } \right) \mid g^* \mathcal Y \right](x) \partial \nu \: . \end{align*}

The integral of a conditional expectation is equal to the integral of the function, hence

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) & \le \int _{x} f \left( \frac{d \mu }{d \nu }(x) \right) \partial \nu \: . \end{align*}

This is equal to \(D_f(\mu , \nu )\).

Theorem 2.3.41 Data-processing

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).

Proof ▶

Let \(\pi _Y : \mathcal X \times \mathcal Y \to \mathcal Y\) be the projection \(\pi _Y((x,y)) = y\). Then \(\kappa \circ \mu = \pi _{Y*}(\mu \otimes \kappa )\) and similarly for \(\nu \). We can use the DPI for that function (Theorem 2.3.40), then Lemma 2.3.38,

\begin{align*} D_f(\kappa \circ \mu , \kappa \circ \nu ) & = D_f(\pi _{Y*}(\mu \otimes \kappa ), \pi _{Y*}(\nu \otimes \kappa )) \\ & \le D_f(\mu \otimes \kappa , \nu \otimes \kappa ) \\ & = D_f(\mu , \nu ) \: . \end{align*}

DPI: proof using derivatives of kernels

This proof does not use Jensen’s inequality for the conditional expectation, but only for the ordinary expectation. It starts by establishing \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\) for Markov kernels \(\kappa \) and \(\eta \), then deduces the DPI. Indeed, if the measurable spaces are standard Borel, we can write \(\mu \otimes \kappa = ((\kappa \circ \mu ) \otimes \kappa ^\dagger _\mu )_\leftrightarrow \) for a Markov kernel \(\kappa ^\dagger _\mu \) (see Definition 48; \((...)_\leftrightarrow \) is notation for composition with the map that swaps the coordinates) and then use the inequality to get

\begin{align*} D(\kappa \circ \mu , \kappa \circ \nu ) & \le D((\kappa \circ \mu ) \otimes \kappa ^\dagger _\mu , (\kappa \circ \nu ) \otimes \kappa ^\dagger _\nu ) \\ & = D(((\kappa \circ \mu ) \otimes \kappa ^\dagger _\mu )_{\leftrightarrow }, ((\kappa \circ \nu ) \otimes \kappa ^\dagger _\nu )_{\leftrightarrow }) \\ & = D(\mu \otimes \kappa , \nu \otimes \kappa ) \\ & = D(\mu , \nu ) \: . \end{align*}

The main drawback of this approach is that it requires that the Bayesian inverse exists, which is not always the case, unless for example the measurable spaces are standard Borel.

The inequality \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\) is obtained by describing the Radon-Nikodym derivative \(\frac{d(\mu \otimes \kappa )}{d(\nu \otimes \eta )}\) and using convexity properties of the function \(f\).

Theorem 2.3.42

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Proof ▶

TODO: the following proof assumes \(\mu \ll \nu \) and \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\). Describe the general case.

Using Lemma B.2.11 and B.1.2,

\begin{align*} D(\mu \otimes \kappa , \nu \otimes \eta ) & = \int _x \int _y f \left( \frac{\partial \mu }{\partial \nu }(x) \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \right) \partial \eta (x) \partial \nu \: . \end{align*}

Since \(f\) is convex, by Jensen’s inequality,

\begin{align*} \int _y f \left( \frac{\partial \mu }{\partial \nu }(x) \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \right) \partial \eta (x) & \ge f \left( \int _y \frac{\partial \mu }{\partial \nu }(x) \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \partial \eta (x) \right) \\ & = f \left( \frac{\partial \mu }{\partial \nu }(x) \int _y \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \partial \eta (x) \right) \: . \end{align*}

Since \(\kappa (x) \ll \eta (x)\) for \(\nu \)-almost all \(x\), \(\int _y \frac{\partial \kappa (x)}{\partial \eta (x)}(y) \partial \eta (x) = \int _y 1 \partial \kappa (x) = 1\) a.e.. We have obtained

\begin{align*} D_f(\mu \otimes \kappa , \nu \otimes \eta ) \ge \int _x f \left( \frac{\partial \mu }{\partial \nu }(x)\right) \partial \nu = D_f(\mu , \nu ) \: . \end{align*}

Theorem 2.3.43 Marginals

✓

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\) where \(\mathcal Y\) is standard Borel, and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mathcal X\) standard Borel and \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).

Proof ▶

We introduce conditional kernels and write \(D(\mu , \nu ) = D(\mu _X \otimes \mu _{Y|X}, \nu _X \otimes \nu _{Y|X})\). Then apply Theorem 2.3.42. For the second statement, we need Lemma 2.3.9 to swap the role of the two coordinates.

Lemma 2.3.44

✓

Let \(\mu , \nu \) be two finite measures on a standard Borel space \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\)

Proof ▶

By definition, \(\kappa \circ \mu \) is the marginal of \(\mu \otimes \kappa \) (a measure on \(\mathcal X \times \mathcal Y\)) on \(\mathcal Y\), and similarly for the other measure. Hence by Theorem 2.3.43, \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Theorem 2.3.45 Conditioning increases f-divergence

✓

Let \(\mu \) be a measure on a standard Borel space \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\)

Proof ▶

By Lemma 2.3.44, \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta )\). This is equal to \(D_f(\kappa , \eta \mid \mu )\) by Lemma 2.3.27.

Theorem 2.3.46 Data-processing

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel, where both \(\mathcal X\) and \(\mathcal Y\) are standard Borel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).

Proof ▶

By Lemma 2.3.44, \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \kappa )\). Then the latter is equal to \(D_f(\mu , \nu )\) by Lemma 2.3.38.

DPI: proof using the integral representation

This proof uses that every \(f\)-divergence is an integral over \(\gamma \ge 0\) of \(f\)-divergences for the function \(\phi _{1,\gamma }\) of Definition 20. It suffices then to prove the DPI for those particular \(f\)-divergences. Those are equal to the \(E_\gamma \) divergence, up to a term that is invariant by composition with a Markov kernel. Finally, we already know that the \(E_\gamma \) divergence satisfies the DPI, and that completes the proof.

Lemma 2.3.47

✓

Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).

\begin{align*} D_{\phi _{a,b}}(\mu , \nu ) & = \text{sign}(b-a)\frac{1}{2}(a \mu (\mathcal X) - b \nu (\mathcal X)) + \frac{1}{2}\nu \left[ \left\vert a \frac{d\mu }{d\nu } - b \right\vert \right] + \frac{1}{2}a \mu _{\perp \nu }(\mathcal X) \: , \end{align*}

in which \(\text{sign}(b-a)\) is \(1\) if \(b-a {\gt} 0\) and \(-1\) if \(b-a \le 0\).

Proof ▶

If \(a \le b\),

\begin{align*} D_{\phi _{a,b}}(\mu , \nu ) & = \nu \left[ \max \{ 0, a \frac{d\mu }{d\nu } - b\} \right] + a \mu _{\perp \nu }(\mathcal X) \\ & = \frac{1}{2}(a \mu _{\parallel \nu }(\mathcal X) - b \nu (\mathcal X)) + \frac{1}{2}\nu \left[ \left\vert a \frac{d\mu }{d\nu } - b \right\vert \right] + a \mu _{\perp \nu }(\mathcal X) \\ & = \frac{1}{2}(a \mu (\mathcal X) - b \nu (\mathcal X)) + \frac{1}{2}\nu \left[ \left\vert a \frac{d\mu }{d\nu } - b \right\vert \right] + \frac{1}{2}a \mu _{\perp \nu }(\mathcal X) \: . \end{align*}

Similar computations lead to the result for \(b \le a\).

Corollary 2.3.48

✓

Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).

\begin{align*} D_{\phi _{a,b}}(\mu , \nu ) = \mathcal I_{(a,b)}(\mu , \nu ) + \frac{1}{2} \left\vert a \mu (\mathcal X) - b \nu (\mathcal X) \right\vert + \text{sign}(b-a)\frac{1}{2}(a \mu (\mathcal X) - b \nu (\mathcal X)) \: . \end{align*}

Proof ▶

Combine Lemma 2.3.47 and Corollary 2.2.10.

Lemma 2.3.49

✓

Let \(a,b \in [0, +\infty )\). Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(D_{\phi _{a,b}}(\kappa \circ \mu , \kappa \circ \nu ) \le D_{\phi _{a,b}}(\mu , \nu )\).

Proof ▶

By Corollary 2.3.48,

The statistical information satisfies the data-processing inequality (Theorem 2.2.5) and the other terms are invariant by composition with a Markov kernel, hence \(D_{\phi _{a,b}}\) also satisfies the data-processing inequality.

Theorem 2.3.50 Data-processing

✓

Proof ▶

By Theorem 2.3.35,

\begin{align*} D_f(\mu , \nu ) = f(1) \nu (\mathcal X) + f’_+(1) (\mu (\mathcal X) - \nu (\mathcal X)) + \int _y D_{\phi _{1,y}}(\mu , \nu ) \partial \gamma _f \: . \end{align*}

The first two terms are unchanged by composition with a Markov kernel. It thus suffices to prove the data processing inequality for the divergences \(D_{\phi _{1, \gamma }}\). This was done in Lemma 2.3.49.

2.3.5 Sigma-algebras

For \(\mathcal A\) a sub-\(\sigma \)-algebra and \(\mu \) a measure, we write \(\mathcal\mu _{| \mathcal A}\) for the measure restricted to the \(\sigma \)-algebra. The measure \(\mathcal\mu _{| \mathcal A}\) is equal to the map of \(\mu \) by the identity, seen as a function to \(\mathcal X\) with the \(\sigma \)-algebra \(\mathcal A\). We can thus get a DPI for sub-sigma-algebras by using the DPI for maps.

Theorem 2.3.51

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\mathcal A\) be a sub-\(\sigma \)-algebra of \(\mathcal X\). Then \(D_f(\mu _{| \mathcal A}, \nu _{| \mathcal A}) \le D_f(\mu , \nu )\).

Proof ▶

The measure \(\mathcal\mu _{| \mathcal A}\) is equal to the map of \(\mu \) by the identity, seen as a function to \(\mathcal X\) with the \(\sigma \)-algebra \(\mathcal A\). That function is measurable since \(\mathcal A\) is a sub-sigma-algebra. We can apply Theorem 2.3.40.

Theorem 2.3.52

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\). Then \(\sup _{\mathcal A \text{ finite}} D_f(\mu _{| \mathcal A}, \nu _{| \mathcal A}) = D_f(\mu , \nu )\).

Proof ▶

Lemma 2.3.53

Let \(\mu , \nu \in \mathcal M(\mathcal X)\) be finite measures with \(\mu \ll \nu \) and let \(g : \mathcal X \to \mathcal Y\) be a measurable function. Denote by \(g^* \mathcal Y\) the comap of the \(\sigma \)-algebra on \(\mathcal Y\) by \(g\). Then \(D_f(g_* \mu , g_* \nu ) = D_f(\mu _{| g^* \mathcal Y}, \nu _{| g^* \mathcal Y})\) .

Proof ▶

Using Lemma B.4.2,

\begin{align*} D_f(g_* \mu , g_* \nu ) & = \int _y f \left( \frac{d g_*\mu }{d g_*\nu }(y)\right) \partial (g_*\nu ) \\ & = \int _x f \left( \frac{d g_*\mu }{d g_*\nu }(g(x))\right) \partial \nu \\ & = \int _x f \left(\frac{d \mu _{| g^* \mathcal Y}}{d \nu _{| g^* \mathcal Y}}(x) \right) \partial \nu \\ & = \int _x f \left(\frac{d \mu _{| g^* \mathcal Y}}{d \nu _{| g^* \mathcal Y}}(x) \right) \partial \nu _{| g^* \mathcal Y} \\ & = D_f(\mu _{| g^* \mathcal Y}, \nu _{| g^* \mathcal Y}) \: . \end{align*}

2.3.6 Consequences of the data-processing inequality

Direct applications

Theorem 2.3.54

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels. Then \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Proof ▶

Since \(\kappa \) is a Markov kernel, \(\mu \) is the composition of \(\mu \otimes \kappa \) and the deterministic kernel for the function \((x,y) \mapsto x\). The same applies for \(\nu \) and \(\nu \otimes \eta \). The result is then an application of the data-processing inequality, Theorem 2.3.50.

Theorem 2.3.55

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Then \(D_f(\kappa \circ \mu , \eta \circ \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).

Proof ▶

\(\kappa \circ \mu \) is the composition of \(\mu \otimes \kappa \) and the deterministic kernel for the function \((x,y) \mapsto y\). The same applies for \(\eta \circ \nu \) and \(\nu \otimes \eta \). The result is then an application of the data-processing inequality, Theorem 2.3.50.

Theorem 2.3.56 Conditioning increases f-divergence

✓

Let \(\mu \) be a finite measure \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\) .

Proof ▶

This is a particular case of Theorem 2.3.55.

Theorem 2.3.57 Marginals

✓

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).

Proof ▶

The measure \(\mu _X\) is the composition of \(\mu \) and the deterministic kernel for the function \((x,y) \mapsto x\), and similarly \(\mu _Y\) is the composition of \(\mu \) and a deterministic kernel. The results are both applications of the data-processing inequality, Theorem 2.3.50.

Lemma 2.3.58

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow (\mathcal X \times \mathcal Y)\) be a Markov kernel such that for all \(x\), \((\kappa (x))_X = \delta _x\). Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) = D_f(\mu , \nu )\).

Proof ▶

\(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\) by Theorem 2.3.50. For the other inequality, remark that \(\mu = (\kappa \circ \mu )_X\) (and similarly for \(\nu \)). Hence by Theorem 2.3.57 \(D_f(\mu , \nu ) \le D_f(\kappa \circ \mu , \kappa \circ \nu )\).

Data-processing and mean values

In this section \(d_f(p, q)\) denotes the \(f\)-divergence between two Bernoulli distributions with means \(p\) and \(q\).

Corollary 2.3.59

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Then \(D_f(\mu , \nu ) \ge d_f(\mu (E), \nu (E))\).

Proof ▶

Use the deterministic kernel \(\kappa : \mathcal X \rightsquigarrow \{ 0, 1\} \) with \(\kappa (x) = \delta _1 \mathbb {I}\{ x \in E\} + \delta _0 \mathbb {I}\{ x \notin E\} \) in Theorem 2.3.50.

Lemma 2.3.60

For all \(y \in [0,1]\), \(x \mapsto d_f(x, y)\) is convex and attains a minimum at \(x = y\).

Proof ▶

\begin{align*} d_f(x, y) & = y f(\frac{x}{y}) + (1 - y) f(\frac{1 - x}{1 - y}) \: , \\ (x \mapsto d_f(x, y))’_+(x) & = f’_+(\frac{x}{y}) - f’_+(\frac{1 - x}{1 - y}) \: . \end{align*}

\(f'_+\) is non-decreasing, hence we get that \((x \mapsto d_f(x, y))'_+\) is also non-decreasing and \(x \mapsto d_f(x, y)\) is convex.

For \(x \le y\), \(x/y \le 1 \le (1-x)/(1-y)\), hence \(f'_+(\frac{x}{y}) \le f'_+(\frac{1 - x}{1 - y})\) and \((x \mapsto d_f(x, y))'_+(x) \le 0\). The opposite holds for \(x \ge y\).

Lemma 2.3.61

Let \(\mu , \nu \in \mathcal P([0,1])\). Then

\begin{align*} D_f(\mu , \nu ) \ge d_f(\mu [X], \nu [X]) \: . \end{align*}

Proof ▶

Let \(u\) be the uniform distribution on \([0,1]\). Then \(D_f(\mu , \nu ) = D_f(\mu \times u, \nu \times u)\) (by Lemma 2.3.58). Let \(g : [0,1] \times [0,1] \to \{ 0,1\} \) be the measurable function with \(g(x, y) = \mathbb {I}\{ y \le x\} \). We also write \(g\) for the corresponding deterministic kernel. By the data-processing inequality (Theorem 2.3.50),

\begin{align*} D_f(\mu \times u, \nu \times u) \ge D_f(g \circ (\mu \times u), g \circ (\nu \times u)) \: . \end{align*}

Those two last measures are Bernoulli distributions with means \(\mu [X]\) and \(\nu [X]\).

Corollary 2.3.62

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow [0,1]\). Then

\begin{align*} D_f(\mu , \nu ) \ge d_f((\kappa \circ \mu )[X], (\kappa \circ \nu )[X]) \: . \end{align*}

Proof ▶

First, by the data-processing inequality (Theorem 2.3.50), \(D_f(\mu , \nu ) \ge D_f(\kappa \circ \mu , \kappa \circ \nu )\) . Then use Lemma 2.3.61.

Lemma 2.3.63

Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then

\begin{align*} D_f(\pi \otimes Q, \xi \otimes P) & \ge d_f(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) \: . \end{align*}

Proof ▶

Suppose first that \(\mathcal R_\pi ^Q \ge \mathcal R_\xi ^P\). Let \(\hat{y}_B\) be a Bayes estimator for the estimation task \((P, y, \ell ')\) with prior \(\xi \). If it does not exist, the proof can be adapted by taking an estimator with risk \(\varepsilon \)-close to the Bayes risk for any \(\varepsilon {\gt} 0\). By Corollary 2.3.62,

\begin{align*} D_f(\pi \otimes Q, \xi \otimes P) & \ge d_f(\ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X], \ell ’ \circ (y \parallel \hat{y}_B) \circ (\xi \otimes P) [X]) \\ & = d_f(\ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X], \mathcal R_\xi ^P) \: . \end{align*}

By definition of the Bayes risk as an infimum,

\begin{align*} \ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X] \ge \mathcal R_\pi ^Q \ge \mathcal R_\xi ^P \: . \end{align*}

Since \(x \mapsto d_f(x, y)\) is monotone on \([y, 1]\),

\begin{align*} D_f(\pi \otimes Q, \xi \otimes P) & \ge d_f(\ell ’ \circ (y \parallel \hat{y}_B) \circ (\pi \otimes Q) [X], \mathcal R_\xi ^P) \\ & \ge d_f(\mathcal R_\pi ^Q, \mathcal R_\xi ^P) \: . \end{align*}

This concludes the proof for \(\mathcal R_\pi ^Q \ge \mathcal R_\xi ^P\). If that inequality is reversed, we proceed similarly but take the Bayes estimator for \(Q\).

2.3.7 Convexity

Theorem 2.3.64 Joint convexity

The function \((\mu , \nu ) \mapsto D_f(\mu , \nu )\) is convex.

Proof ▶

Let \(\mu _0, \mu _1, \nu _0, \nu _1\) be four measures. Let \(\lambda \in [0,1]\). Let \(\xi \) be the probability measure on \(\{ 0,1\} \) with \(\xi (\{ 1\} ) = \lambda \). Let \(\kappa \) be the kernel \(\{ 0,1\} \rightsquigarrow \mathcal X\) defined by \(\kappa (0) = \mu _0\) and \(\kappa (1) = \mu _1\). Let \(\eta \) be the kernel \(\{ 0,1\} \rightsquigarrow \mathcal X\) defined by \(\eta (0) = \nu _0\) and \(\eta (1) = \nu _1\).

\begin{align*} D_f(\xi \otimes \kappa , \xi \otimes \eta ) & = D_f(\kappa , \eta \mid \xi ) = (1 - \lambda ) D_f(\mu _0, \nu _0) + \lambda D_f(\mu _1, \nu _1) \: , \\ D_f(\kappa \circ \xi , \eta \circ \xi ) & = D_f((1 - \lambda )\mu _0 + \lambda \mu _1, (1 - \lambda )\nu _0 + \lambda \nu _1) \: . \end{align*}

By Theorem 2.3.56, \(D_f(\kappa \circ \xi , \eta \circ \xi ) \le D_f(\xi \otimes \kappa , \xi \otimes \eta )\).

2.3.8 Variational representations

TODO

2.3.9 Comparisons between divergences

We collate here general tools that can be used to compare \(D_f\) and \(D_g\) for two functions \(f\) and \(g\). Specific divergences are discussed in the next sections.

Lemma 2.3.65

If \(f \le g\), then \(D_f(\mu , \nu ) \le D_g(\mu , \nu )\).

Proof ▶

Monotonicity of the integral and of \(f \mapsto f'(\infty )\).

Lemma 2.3.66

If \(f(1) = 0\), \(g(1) = 0\), \(f'(1) = 0\), \(g'(1) = 0\), and both \(f\) and \(g\) have a second derivative, then

\begin{align*} D_f(\mu , \nu ) \le \sup _x \frac{f''(x)}{g''(x)} D_g(\mu , \nu ) \end{align*}

Proof ▶

By Taylor with integral remainder, if \(f'' \le \beta g''\),

\begin{align*} f(x) = \int _1^x f”(t) (x - t) dt \le \beta \int _1^x g”(t) (x - t) dt = \beta g(x) \: . \end{align*}

Then use Lemma 2.3.65.

Theorem 2.3.67 [ SV16 ]

Suppose that \(f(1) = g(1) = 0\) and that \(f'(1) = g'(1) = 0\), and that \(g{\gt}0\) on \((0,1) \cup (1, +\infty )\). Suppose that the space \(\mathcal X\) has at least two disjoint non-empty measurable sets. Then

\begin{align*} \sup _{\mu , \nu \in \mathcal P(\mathcal X)} \frac{D_f(\mu , \nu )}{D_g(\mu , \nu )} = \sup _{t {\gt} 0, t \ne 1} \frac{f(t)}{g(t)} \: . \end{align*}

Remark: if \(\mathcal X\) does not have two disjoint non-empty measurable sets then its measurable sets are only \(\{ \emptyset , \mathcal X\} \) and the ratio on the left is \(0/0\).

Proof ▶

TODO: proof done for measurable singletons?

By Lemma 2.3.65, \(\sup _{\mu , \nu \in \mathcal P(\mathcal X)} \frac{D_f(\mu , \nu )}{D_g(\mu , \nu )} \le \sup _{t {\gt} 0, t \ne 1} \frac{f(t)}{g(t)}\). Let’s prove the other inequality.

Let \(x\) and \(y\) be two points in disjoint non-empty measurable sets. For \(t \in (0,1) \cup (1, +\infty )\) and \(\varepsilon \in (0, 1/t]\) let \(\mu _{t, \varepsilon } = t \varepsilon \delta _x + (1 - t \varepsilon ) \delta _y\) and \(\nu _\varepsilon = \varepsilon \delta _x + (1 - \varepsilon ) \delta _y\) .

\begin{align*} D_f(\mu _{t, \varepsilon }, \nu _\varepsilon ) & = \varepsilon f(t) + (1 - \varepsilon ) f(\frac{1 - t \varepsilon }{1 - \varepsilon }) \\ & = \varepsilon f(t) + \varepsilon (1 - t) \frac{1 - \varepsilon }{\varepsilon (1 - t)} f(1 + \frac{\varepsilon (1 - t)}{1 - \varepsilon }) \end{align*}

Let \(\alpha = \frac{\varepsilon (1 - t)}{1 - \varepsilon }\). Then since \(f(1) = g(1) = 0\),

\begin{align*} \frac{D_f(\mu _{t, \varepsilon }, \nu _\varepsilon )}{D_g(\mu _{t, \varepsilon }, \nu _\varepsilon )} & = \frac{f(t) + (1 - t)\frac{f(1 + \alpha ) - f(1)}{\alpha }}{g(t) + (1 - t)\frac{g(1 + \alpha ) - g(1)}{\alpha }} \: . \end{align*}

As \(\varepsilon \to 0\), \(\alpha \) also tends to 0 and

\begin{align*} \frac{D_f(\mu _{t, \varepsilon }, \nu _\varepsilon )}{D_g(\mu _{t, \varepsilon }, \nu _\varepsilon )} \to \frac{f(t) + (1 - t)f'(1)}{g(t) + (1 - t)g'(1)} = \frac{f(t)}{g(t)} \: . \end{align*}

Bretagnolle-Huber inequalities

Theorem 2.3.68

Let \(\mu , \nu \in \mathcal P(\mathcal X)\). If \(f(1) = 0\),

\begin{align*} D_f(\mu , \nu ) & \ge f(1 + \operatorname{TV}(\mu , \nu )) + f(1 - \operatorname{TV}(\mu , \nu )) \: , \\ \text{ and } D_f(\mu , \nu ) & \ge (1 + \operatorname{TV}(\mu , \nu ))f\left(\frac{1}{1 + \operatorname{TV}(\mu , \nu )}\right) + (1 - \operatorname{TV}(\mu , \nu ))f\left(\frac{1}{1 - \operatorname{TV}(\mu , \nu )}\right) \: . \end{align*}

Proof ▶

Let \(V: x \mapsto \max \left\{ 0, \frac{d\mu }{d\nu }(x) - 1\right\} \) and \(W: x \mapsto \max \left\{ 0, 1 - \frac{d\mu }{d\nu }(x)\right\} \) . Then one of \(1+V\) and \(1-W\) is equal to \(\frac{d\mu }{d\nu }\) and the other is equal to 1. Since \(f(1) = 0\), we obtain \(f(\frac{d\mu }{d\nu }) = f(1+V) + f(1-W)\) . The \(f\)-divergence is then

\begin{align*} D_f(\mu , \nu ) & = \nu [f(\frac{d\mu }{d\nu })] + f’(\infty )\mu _{\perp \nu }(\mathcal X) \\ & = \nu \left[f(1+V) + f’(\infty )\mu _{\perp \nu }(\mathcal X)\right] + \nu \left[f(1 - W)\right] \: . \end{align*}

By convexity of \(f\), \(f(1+V) + f'(\infty )\mu _{\perp \nu }(\mathcal X) \ge f(1+V + \mu _{\perp \nu }(\mathcal X))\) . Using convexity again, we get

\begin{align*} D_f(\mu , \nu ) & \ge f(1+ \nu [V] + \mu _{\perp \nu }(\mathcal X)) + f(1 - \nu [W]) \: . \end{align*}

By Lemma 2.2.21, \(\nu [V] + \mu _{\perp \nu }(\mathcal X) = \nu [W] = \operatorname{TV}(\mu , \nu )\). We obtain the first equality.

To get the second equality, remark that \(D_f(\mu , \nu ) = D_{x\mapsto xf(1/x)}(\nu , \mu )\) (Theorem TODO), apply the first inequality to that function and use the symmetry of \(\operatorname{TV}\).

Corollary 2.3.69 Bretagnolle-Huber inequality

TODO: move this somewhere after the definition of \(KL\).

For \(\mu , \nu \in \mathcal P(\mathcal X)\),

\begin{align*} \operatorname{KL}(\mu , \nu ) & \ge - \log (1 - \operatorname{TV}^2(\mu , \nu )) \: , \\ \text{or equivalently, } \operatorname{TV}(\mu , \nu ) & \le \sqrt{1 - \exp \left(-\operatorname{KL}(\mu , \nu ) \right)} \: . \end{align*}

Proof ▶

Use the second inequality of Theorem 2.3.68, for \(f: x \mapsto x \log x\) .

Corollary 2.3.70

TODO: move this somewhere after the definition of \(\operatorname{H}_\alpha \).

Let \(\mu , \nu \in \mathcal P(\mathcal X)\). For \(\alpha {\gt} 0\),

\begin{align*} \operatorname{H}_{1+\alpha }(\mu , \nu ) & \ge \frac{1}{\alpha } \left( (1 - \operatorname{TV}(\mu , \nu ))^{-\alpha } - 2 \right) \: , \\ \text{or equivalently, } 1 - \operatorname{TV}(\mu , \nu ) & \ge \exp \left(-\frac{1}{\alpha } \log \left( 2 + \alpha \operatorname{H}_{1+\alpha }(\mu , \nu ) \right)\right) \: . \end{align*}

Proof ▶

Use the second inequality of Theorem 2.3.68, for \(f: x \mapsto \frac{x^{1+\alpha } - 1}{\alpha }\) to get

\begin{align*} \operatorname{H}_{1+\alpha }(\mu , \nu ) \ge \frac{1}{\alpha } \left((1 + \operatorname{TV}(\mu , \nu ))^{-\alpha } + (1 - \operatorname{TV}(\mu , \nu ))^{-\alpha } - 2 \right) \: . \end{align*}

We then neglect one term: \((1 + \operatorname{TV}(\mu , \nu ))^{-\alpha } \ge 0\).

Joint range

TODO [ HV11 ]

2.3.10 Statistical divergences

Lemma 2.3.71

On probability measures, the statistical information \(\mathcal I_\xi \) is an \(f\)-divergence for the function \(\phi _{\xi _0, \xi _1}\) of Definition 20.

Proof ▶

This is a reformulation of Theorem 2.2.9.

Lemma 2.3.72

On probability measures, the total variation distance \(\operatorname{TV}\) is an \(f\)-divergence for the function \(x \mapsto \frac{1}{2}\vert x - 1 \vert \).

Proof ▶

This is a reformulation of Lemma 2.2.22.

2.4 Kullback-Leibler divergence

Definition 21 Kullback-Leibler divergence

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Kullback-Leibler divergence between \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{KL}(\mu , \nu ) = \mu \left[\log \frac{d \mu }{d \nu }\right] \end{align*}

if \(\mu \ll \nu \) and \(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable and \(+\infty \) otherwise.

Lemma 2.4.1

✓

\(\operatorname{KL}(\mu , \nu ) = D_f(\mu , \nu )\) for \(f: x \mapsto x \log x\).

Proof ▶

Simple computation.

Definition 22 Conditional Kullback-Leibler divergence

✓

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Kullback-Leibler divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is

\begin{align*} \operatorname{KL}(\kappa , \eta \mid \mu ) = \mu \left[x \mapsto \operatorname{KL}(\kappa (x), \eta (x))\right] \end{align*}

if \(x \mapsto \operatorname{KL}(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.

Lemma 2.4.2

✓

\(\operatorname{KL}(\kappa , \eta \mid \mu ) = D_f(\kappa , \eta \mid \mu )\) for \(f: x \mapsto x \log x\).

Proof ▶

Simple computation.

2.4.1 Properties inherited from f-divergences

Since \(\operatorname{KL}\) is an f-divergence, every inequality for f-divergences can be translated to \(\operatorname{KL}\).

Lemma 2.4.3

✓

\(\operatorname{KL}(\mu , \mu ) = 0\).

Proof ▶

KL is an f-divergence thanks to Lemma 2.4.1, then apply Lemma 2.3.3.

Lemma 2.4.4

✓

Let \(\mu , \nu \) be finite measures on \(\mathcal X\), \(\xi \) be a finite measure on \(\mathcal Y\). Then \(\operatorname{KL}(x \mapsto \mu , x \mapsto \nu \mid \xi ) = \operatorname{KL}(\mu , \nu ) \xi (\mathcal X)\).

Proof ▶

KL is an f-divergence thanks to Lemma 2.4.1 and Lemma 2.4.2, then apply Lemma 2.3.23.

Theorem 2.4.5 Marginals

✓

Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(\operatorname{KL}(\mu _X, \nu _X) \le \operatorname{KL}(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(\operatorname{KL}(\mu _Y, \nu _Y) \le \operatorname{KL}(\mu , \nu )\).

Proof ▶

Apply Theorem 2.3.57.

Theorem 2.4.6 Data-processing

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(\operatorname{KL}(\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{KL}(\mu , \nu )\).

Proof ▶

Apply Theorem 2.3.50.

Lemma 2.4.7

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\). Then \(\operatorname{KL}(\mu , \nu ) \ge \mu (\mathcal X) \log \frac{\mu (\mathcal X)}{\nu (\mathcal X)}\).

Proof ▶

Lemma 2.4.8 Gibbs’ inequality

✓

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{KL}(\mu , \nu ) \ge 0\).

Proof ▶

Apply Lemma 2.4.7 and use \(\mu (\mathcal X) = \nu (\mathcal X)\).

Lemma 2.4.9 Converse Gibbs’ inequality

✓

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{KL}(\mu , \nu ) = 0\) if and only if \(\mu = \nu \).

Proof ▶

KL is an f-divergence thanks to Lemma 2.4.1, then apply Lemma 2.3.4.

Lemma 2.4.10

\((\mu , \nu ) \mapsto \operatorname{KL}(\mu , \nu )\) is convex.

Proof ▶

Apply Theorem 2.3.64

Lemma 2.4.11

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels such that \(\mu \otimes \kappa \ll \nu \otimes \eta \). Then \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is \(\mu \otimes \kappa \)-integrable if and only if the following hold:

\(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable
\(x \mapsto \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x)\) is \(\mu \)-integrable
for \(\mu \)-almost all \(x\), \(y \mapsto \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable

Proof ▶

Note that from the hypothesis it easily follows that \(\mu \ll \nu \) and \(\mu \)-a.e. \(\kappa (x) \ll \eta (x)\).

Applying Lemma 2.3.24 we know that \(\log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }\) is \(\nu \otimes \eta \)-integrable iff the following hold:

for \(\nu \)-almost all \(x\), \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \) is \(\eta (x)\)-integrable
\(x \mapsto \int _y \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x)\) is \(\nu \)-integrable

(i.\(\Leftarrow \)) 3. implies that \(\nu \)-a.e. \(y \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable, hence \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\eta (x)\)-integrable \(\nu \)-a.e. Moreover \(y \mapsto \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\eta (x)\)-integrable, then, since \(\frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x)\) is constant in \(y\), \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \mu }{d \nu }(x)\) is \(\eta (x)\)-integrable. This allows us to conclude that \(y \mapsto \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) + \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \mu }{d \nu }(x) = \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right)\) is \(\eta (x)\)-integrable \(\nu \)-a.e.

(ii.\(\Leftarrow \)) 1. implies that \(x \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x)\) is \(\nu \)-integrable. 2. implies that \(x \mapsto \frac{d \mu }{d \nu }(x) \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \eta (x)\) is \(\nu \)-integrable. Therefore

\[ x \mapsto \frac{d \mu }{d \nu }(x) \log \frac{d \mu }{d \nu }(x) + \frac{d \mu }{d \nu }(x) \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \eta (x) = \int _y \frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x) \]

is \(\nu \)-integrable.

(\(\Rightarrow \)1.) This follows from the hypothesis applying a more general Lemma for f divergences, see ProbabilityTheory.integrable_f_rnDeriv_of_integrable_compProd’ in the lean code.

(\(\Rightarrow \)3.) i. implies that \(\mu \)-a.e. \(y \mapsto \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right)\) is \(\eta (x)\)-integrable, hence \(y \mapsto \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) = \log \frac{d \mu }{d \nu }(x) + \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable. Then, since \(\log \frac{d \mu }{d \nu }(x)\) is constant in \(y\) and \(\kappa (x)\) is a finite measure, \(y \mapsto \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable.

(\(\Rightarrow \)2.) ii. implies that \(x \mapsto \int _y \frac{d \kappa (x)}{d \eta (x)}(y) \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \eta (x) = \int _y \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \kappa (x)\) is \(\mu \)-integrable. Thanks to 3. we can split the integral:

\[ \int _y \log \left(\frac{d \mu }{d \nu }(x) \frac{d \kappa (x)}{d \eta (x)}(y) \right) \partial \kappa (x) = \int _y \log \frac{d \mu }{d \nu }(x) \partial \kappa (x) + \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x) = \log \frac{d \mu }{d \nu }(x) + \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x) \]

Then using 1. we can conclude that \(x \mapsto \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x)\) is \(\mu \)-integrable.

2.4.2 Chain rule and tensorization

Theorem 2.4.12

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels. Then \(\operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\mu \otimes \kappa , \mu \otimes \eta )\).

Proof ▶

By Lemma B.2.1, \(\mu \otimes \kappa \ll \nu \otimes \eta \iff \left( \mu \ll \nu \ \wedge \ \mu \otimes \kappa \ll \mu \otimes \eta \right)\). Hence if any of the two sides of the equality is infinite due to a lack of absolute continuity, the other is as well. We now suppose that \(\mu \otimes \kappa \ll \nu \otimes \eta \), \(\mu \ll \nu \) and \(\mu \otimes \kappa \ll \mu \otimes \eta \).

TODO: deal with the non-integrable case, without using Lemma 2.4.11 that requires countably generated sigma-algebras.

Suppose then that all three log-likelihood ratios are integrable. We use the chain rule for Radon-Nikodym derivatives (Theorem B.2.7).

\begin{align*} \operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) & = \int _p \log \frac{d(\mu \otimes \kappa )}{d(\nu \otimes \eta )}(p) \partial (\mu \otimes \kappa ) \\ & = \int _x \int _y\log \left(\frac{d\mu }{d \nu }(x)\frac{d(\mu \otimes \kappa )}{d (\mu \otimes \eta )}(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \int _y\log \left(\frac{d\mu }{d \nu }(x)\right) + \log \left(\frac{d(\mu \otimes \kappa )}{d (\mu \otimes \eta )}(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \log \left(\frac{d\mu }{d \nu }(x)\right)\partial \mu + \int _x \int _y\log \left(\frac{d(\mu \otimes \kappa )}{d (\mu \otimes \eta )}(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \log \left(\frac{d\mu }{d \nu }(x)\right)\partial \mu + \int _p\log \left(\frac{d(\mu \otimes \kappa )}{d (\mu \otimes \eta )}(p)\right) \partial (\mu \otimes \kappa ) \\ & = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\mu \otimes \kappa , \mu \otimes \eta ) \: . \end{align*}

Theorem 2.4.13 Chain rule, kernel version

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \(\operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\kappa , \eta \mid \mu )\).

Proof ▶

Handle the case where \(\mu \otimes \kappa \) is not absolutely continuous with respect to \(\nu \otimes \eta \) first, then the case where \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is not \(\mu \otimes \kappa \)-integrable using Lemma 2.4.11.

For the case where \(\mu \otimes \kappa \ll \nu \otimes \eta \) and \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is \(\mu \otimes \kappa \)-integrable use Lemma B.2.11 and Corollary B.1.2 in a computation:

\begin{align*} \operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) & = \int _p \log \frac{d(\mu \otimes \kappa )}{d(\nu \otimes \eta )}(p) \partial (\mu \otimes \kappa ) \\ & = \int _x \int _y\log \left(\frac{d\mu }{d \nu }(x)\frac{d\kappa }{d \eta }(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \int _y\log \left(\frac{d\mu }{d \nu }(x)\right) + \log \left(\frac{d\kappa }{d \eta }(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \log \left(\frac{d\mu }{d \nu }(x)\right)\partial \mu + \int _x \int _y\log \left(\frac{d\kappa }{d \eta }(x,y)\right) \partial \kappa (x) \partial \mu \\ & = \int _x \log \left(\frac{d\mu }{d \nu }(x)\right)\partial \mu + \int _x \int _y\log \left(\frac{d\kappa (x)}{d \eta (x)}(y)\right) \partial \kappa (x) \partial \mu \\ & = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\kappa , \eta \mid \mu ) \: . \end{align*}

Theorem 2.4.14 Chain rule, with Bayesian inverse

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels with Bayesian inverses \(\kappa _\mu ^\dagger \) and \(\eta _\nu ^\dagger \). Then

\begin{align*} \operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\kappa \circ \mu , \eta \circ \nu ) + \operatorname{KL}(\kappa _\mu ^\dagger , \eta _\nu ^\dagger \mid \kappa \circ \mu ) \: . \end{align*}

Proof ▶

Theorem 2.4.15 Chain rule, product version

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X \times \mathcal Y\), where \(\mathcal Y\) is standard Borel. Then \(\operatorname{KL}(\mu , \nu ) = \operatorname{KL}(\mu _X, \nu _X) + \operatorname{KL}(\mu _{Y|X}, \nu _{Y|X} \mid \mu _X)\).

Proof ▶

Write \(\mu = \mu _X \otimes \mu _{Y|X}\) and \(\nu = \nu _X \otimes \nu _{Y|X}\), then use Theorem 2.4.13.

Lemma 2.4.16

✓

Let \(\mu _1, \nu _1\) be finite measures on \(\mathcal X\) and \(\mu _2, \nu _2\) probability measures on \(\mathcal{Y}\). Then

\[ \operatorname{KL}(\mu _1\times \mu _2, \nu _1 \times \nu _2) = \operatorname{KL}(\mu _1, \nu _1) + \operatorname{KL}(\mu _2, \nu _2) \mu _1 (\mathcal X) \: . \]

Proof ▶

Write \(\mu _1 \times \mu _2\) and \(\nu _1 \times \nu _2\) as composition products of a measure and a constant kernel, then apply the chain rule (Theorem 2.4.13) and Lemma 2.4.4.

Theorem 2.4.17 Tensorization

✓

For \(\mu _1\) a probability measure on \(\mathcal X\), \(\nu _1\) a finite measure on \(\mathcal{X}\) and \(\mu _2, \nu _2\) two probability measures on \(\mathcal Y\),

\begin{align*} \operatorname{KL}(\mu _1 \times \mu _2, \nu _1 \times \nu _2) = \operatorname{KL}(\mu _1, \nu _1) + \operatorname{KL}(\mu _2 \times \nu _2) \: . \end{align*}

Proof ▶

This is a particular case of Lemma 2.4.16.

Theorem 2.4.18 Tensorization - finite product

✓

Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on spaces \((\mathcal X_i)_{i \in I}\). Then

\begin{align*} \operatorname{KL}(\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} \operatorname{KL}(\mu _i, \nu _i) \: . \end{align*}

Proof ▶

Induction on the size of \(I\), using Theorem 2.4.17.

Lemma 2.4.19

Let \(\mu , \nu \) be two measures and \(E\) an event. Let \(\mu _{|E}\) be the measure defined by \(\mu _{|E}(A) = \frac{\mu (A \cap E)}{\mu (E)}\) and define \(\nu _{|E}\), \(\mu _{| E^c}\) and \(\nu _{| E^c}\) similarly. Let \(\operatorname{kl}(p,q) = p\log \frac{p}{q} + (1-p)\log \frac{1-p}{1-q}\) be the Kullback-Leibler divergence between Bernoulli distributions with means \(p\) and \(q\). Then

\begin{align*} \operatorname{KL}(\mu , \nu ) \ge \operatorname{kl}(\mu (E), \nu (E)) + \mu (E) \operatorname{KL}(\mu _{|E}, \nu _{|E}) + \mu (E^c) \operatorname{KL}(\mu _{|E^c}, \nu _{|E^c}) \: . \end{align*}

Proof ▶

Let \(\mu _E\) be the Bernoulli distribution with \(\mu _E(\{ 1\} ) = \mu (E)\), and define \(\nu _E\) similarly. Let \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel defined by \(\kappa (1) = \mu _{|E}\) and \(\kappa (0) = \mu _{|E^c}\). Define a kernel \(\eta \) similarly for \(\nu \). Then \(\mu = \kappa \circ \mu _E\) and \(\nu = \eta \circ \nu _E\).

\(\mu _E \otimes \kappa \) is equal to the composition of \(\mu \) and the deterministic kernel given by the function \(f : x \mapsto (\mathbb {I}_E(x), x)\). By Lemma 2.3.58, that gives \(\operatorname{KL}(\mu _E \otimes \kappa , \nu _E \otimes \eta ) = \operatorname{KL}(\mu , \nu )\).

Using the chain rule (Theorem 2.4.13),

\begin{align*} \operatorname{KL}(\mu _E \otimes \kappa , \nu _E \otimes \eta ) & = \operatorname{KL}(\mu _E, \nu _E) + \operatorname{KL}(\kappa , \eta \mid \mu _E) \\ & = \operatorname{KL}(\mu _E, \nu _E) + \mu (E) \operatorname{KL}(\mu _{|E}, \nu _{|E}) + \mu (E^c) \operatorname{KL}(\mu _{|E^c}, \nu _{|E^c}) \: . \end{align*}

Lemma 2.4.20

Let \(\mu , \nu \) be two measures and \(E\) an event. Then \(\mu (E)\log \frac{\mu (E)}{\nu (E)} \le \mu \left[\mathbb {I}(E)\log \frac{d \mu }{d \nu }\right]\) .

Proof ▶

Apply Lemma 2.4.7 to the measures \(\mu \) and \(\nu \) restricted to \(E\). \(\mu \left[\mathbb {I}(E)\log \frac{d \mu }{d \nu }\right]\) is the KL between those two distributions.

TODO: find a way to hide the following dummy lemma

Lemma 2.4.21 Dummy lemma: KL properties

✓

Dummy node to summarize properties of the Kullback-Leibler divergence.

Proof ▶

2.5 Hellinger alpha-divergences

Definition 23 Hellinger \(\alpha \)-divergence

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Hellinger divergence of order \(\alpha \in [0,+\infty )\) between \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{H}_\alpha (\mu , \nu ) = \left\{ \begin{array}{ll} \nu \{ x \mid \frac{d\mu }{d\nu } (x) = 0\} & \text{for } \alpha = 0 \\ \operatorname{KL}(\mu , \nu ) & \text{for } \alpha = 1 \\ D_{f_\alpha }(\mu , \nu ) & \text{for } \alpha \in (0,+\infty ) \backslash \{ 1\} \end{array}\right. \end{align*}

with \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).

Lemma 2.5.1

✓

For \(\alpha \in [0, 1)\) and finite measures \(\mu , \nu \), \(\operatorname{H}_\alpha (\mu , \nu ) {\lt} \infty \).

Proof ▶

Lemma 2.5.2

✓

For \(\alpha \in (0,1)\cup (1, \infty )\), \(\mu \) a finite measure and \(\nu \) a probability measure, if \(\operatorname{H}_\alpha (\mu , \nu ) {\lt} \infty \) then

\begin{align*} \operatorname{H}_\alpha (\mu , \nu ) = \frac{1}{\alpha - 1} \left( \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu - 1 \right) \: . \end{align*}

Proof ▶

Lemma 2.5.3

✓

For \(\alpha \in (0,1)\cup (1, \infty )\), \(\mu , \nu \) two sigma-finite measures,

\begin{align*} \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu = \int _x \left(\frac{d \nu }{d \mu }(x)\right)^{1 - \alpha } \partial \mu \: . \end{align*}

Proof ▶

\begin{align*} \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^\alpha \partial \nu \\ & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^\alpha \frac{d \nu }{d (\mu + \nu )}(x) \partial (\mu + \nu ) \\ & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^{\alpha - 1} \frac{d \mu }{d (\mu + \nu )}(x) \partial (\mu + \nu ) \\ & = \int _x \left(\frac{d \mu }{d (\mu + \nu )}(x) \left(\frac{d \nu }{d (\mu + \nu )}(x)\right)^{-1} \right)^{\alpha - 1} \partial \mu \\ & = \int _x \left(\frac{d \nu }{d \mu }(x)\right)^{1 - \alpha } \partial \mu \: . \end{align*}

Lemma 2.5.4

✓

For \(\alpha \in (0, 1)\) and finite measures \(\mu , \nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\), \((1 - \alpha ) \operatorname{H}_\alpha (\mu , \nu ) = \alpha \operatorname{H}_{1 - \alpha }(\nu , \mu )\).

Proof ▶

Use Lemma 2.5.3.

Definition 24 Conditional Hellinger \(\alpha \)-divergence

✓

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Hellinger divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is

\begin{align*} \operatorname{H}_\alpha (\kappa , \eta \mid \mu ) = D_{f_\alpha }(\kappa , \eta \mid \mu ) \: , \end{align*}

for \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).

2.5.1 Properties inherited from f-divergences

Since \(\operatorname{H}_\alpha \) is an f-divergence, every inequality for f-divergences can be translated to \(\operatorname{H}_\alpha \).

Theorem 2.5.5 Data-processing

✓

Let \(\alpha {\gt} 0\), \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(\operatorname{H}_\alpha (\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{H}_\alpha (\mu , \nu )\).

Proof ▶

Apply Theorem 2.3.50.

Lemma 2.5.6

✓

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{H}_\alpha (\mu , \nu ) \ge 0\).

Proof ▶

Apply Lemma 2.3.20.

Lemma 2.5.7

\((\mu , \nu ) \mapsto \operatorname{H}_\alpha (\mu , \nu )\) is convex.

Proof ▶

Apply Theorem 2.3.64

TODO: find a way to hide the following dummy lemma

Lemma 2.5.8 Dummy lemma: hellingerAlpha properties

✓

Dummy node to summarize properties of the Hellinger \(\alpha \)-divergence.

Proof ▶

2.6 Rényi divergences

Definition 25 Rényi divergence

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Rényi divergence of order \(\alpha \in \mathbb {R}\) between \(\mu \) and \(\nu \) is

\begin{align*} R_\alpha (\mu , \nu ) = \left\{ \begin{array}{ll} \operatorname{KL}(\mu , \nu ) & \text{for } \alpha = 1 \\ \frac{1}{\alpha - 1} \log (\nu (\mathcal X) + (\alpha - 1) \operatorname{H}_\alpha (\mu , \nu )) & \text{for } \alpha \neq 1 \end{array}\right. \end{align*}

Lemma 2.6.1

✓

For \(\mu \) a sigma-finite measure and \(\nu \) a finite measure

\[ R_0(\mu , \nu ) = - \log (\nu \{ x \mid \frac{d \mu }{d \nu }(x) {\gt} 0\} ) \: . \]

Proof ▶

Lemma 2.6.2

✓

For \(\alpha \in [0, 1)\) and finite measures \(\mu , \nu \),

\[ R_\alpha (\mu , \nu ) = \infty \iff \mu \perp \nu \: . \]

Proof ▶

Lemma 2.6.3

✓

For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then

\begin{align*} R_\alpha (\mu , \nu ) = \frac{1}{\alpha - 1} \log \int _x \left(\frac{d \mu }{d \nu }(x)\right)^\alpha \partial \nu \: . \end{align*}

Proof ▶

Lemma 2.6.4

✓

For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then

\begin{align*} R_\alpha (\mu , \nu ) = \frac{1}{\alpha - 1} \log \int _x \left(\frac{d \mu }{d \nu }(x)\right)^{\alpha - 1} \partial \mu \: . \end{align*}

Proof ▶

Lemma 2.6.5

✓

For \(\alpha \in (0, 1)\) and finite measures \(\mu , \nu \) with \(\mu (\mathcal X) = \nu (\mathcal X)\),

\[ (1 - \alpha ) R_\alpha (\mu , \nu ) = \alpha R_{1 - \alpha }(\nu , \mu ) \: . \]

Proof ▶

Use Lemma 2.5.4.

Lemma 2.6.6

✓

The cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\nu \) is \(\alpha \mapsto (\alpha - 1) R_\alpha (\mu , \nu )\).

Proof ▶

Unfold the definitions, using Lemma 2.6.3 for the Rényi divergence.

Lemma 2.6.7

✓

Set \(\alpha {\gt} 0\). If \(R_{1+\alpha }(\mu , \nu ) {\lt} \infty \), the cumulant generating function of \(\log \frac{d\mu }{d\nu }\) under \(\mu \) has value \(\alpha R_{1+\alpha }(\mu , \nu )\) at \(\alpha \).

Proof ▶

Unfold the definitions, using Lemma 2.6.4 for the Rényi divergence.

Definition 26 Conditional Rényi divergence

✓

Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Rényi divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is

\begin{align*} R_\alpha (\kappa , \eta \mid \mu ) =\frac{1}{\alpha - 1} \log (1 + (\alpha - 1) \operatorname{H}_\alpha (\kappa , \eta \mid \mu )) \: . \end{align*}

TODO: the reasons for this definition is that it is equal to \(R_\alpha (\mu \otimes \kappa , \mu \otimes \eta )\), which is the generic conditional divergence definition, and also that this is the definition for which we have a chain rule.

2.6.1 Properties inherited from f-divergences

Since Rényi divergences are monotone transfomations of f-divergences, every inequality for f-divergences can be translated to Rényi divergences.

Theorem 2.6.8 Data-processing

✓

Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(R_\alpha (\kappa \circ \mu , \kappa \circ \nu ) \le R_\alpha (\mu , \nu )\).

Proof ▶

The function \(x \mapsto \frac{1}{\alpha - 1}\log (1 + (\alpha - 1)x)\) is non-decreasing and \(D_f\) satisfies the DPI (Theorem 2.3.50), hence we get the DPI for \(R_\alpha \).

Lemma 2.6.9

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\).

Proof ▶

By Corollary 2.3.59, \(D_f(\mu , \nu ) \ge D_f(\mu _E, \nu _E)\), hence \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\).

2.6.2 Comparisons between orders

Lemma 2.6.10

Let \(\mu , \nu \) be two finite measures. \(R_0(\mu , \nu ) = \lim _{\alpha \downarrow 0} R_\alpha (\mu , \nu )\).

Proof ▶

Lemma 2.6.11

Let \(\mu , \nu \) be two finite measures. \(R_1(\mu , \nu ) = \lim _{\alpha \uparrow 1} R_\alpha (\mu , \nu )\).

Proof ▶

Lemma 2.6.12

Let \(\mu , \nu \) be two finite measures such that there exists \(\alpha {\gt} 1\) with \(R_\alpha (\mu , \nu )\) finite. Then \(R_1(\mu , \nu ) = \lim _{\alpha \downarrow 1} R_\alpha (\mu , \nu )\).

Proof ▶

Lemma 2.6.13

Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is nondecreasing on \([0, + \infty )\).

Proof ▶

Lemma 2.6.14

Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is continuous on \([0, 1]\) and on \([0, \sup \{ \alpha \mid R_\alpha (\mu , \nu ) {\lt} \infty \} )\).

Proof ▶

2.6.3 The tilted measure

Definition 27

✓

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\alpha \in (0, +\infty ) \backslash \{ 1\} \). Let \(p = \frac{d \mu }{d (\mu + \nu )}\) and \(q = \frac{d \nu }{d (\mu + \nu )}\). We define a measure \(\mu ^{(\alpha , \nu )}\), absolutely continuous with respect to \(\mu + \nu \) with density

\begin{align*} \frac{d \mu ^{(\alpha , \nu )}}{d (\mu + \nu )} = p^\alpha q^{1 - \alpha } e^{-(\alpha - 1)R_\alpha (\mu , \nu )} \: . \end{align*}

Lemma 2.6.15

For \(\alpha \in (0,1)\), \(\mu ^{(\alpha , \nu )} \ll \mu \) and \(\mu ^{(\alpha , \nu )} \ll \nu \).

Proof ▶

Lemma 2.6.16

\(\frac{d \mu ^{(\alpha , \nu )}}{d \nu } = \left(\frac{d\mu }{d\nu }\right)^\alpha e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\nu \)-a.e., and \(\frac{d \mu ^{(\alpha , \nu )}}{d \mu } = \left(\frac{d\nu }{d\mu }\right)^{1 - \alpha } e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\mu \)-a.e..

Proof ▶

Lemma 2.6.17

Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} \operatorname{KL}(\xi , \mu ^{(\alpha , \nu )}) = \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) - (1 - \alpha ) R_\alpha (\mu , \nu ) \: . \end{align*}

Proof ▶

Unfold definitions and compute?

Corollary 2.6.18

Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} (1 - \alpha ) R_\alpha (\mu , \nu ) = \alpha \operatorname{KL}(\mu ^{(\alpha , \nu )}, \mu ) + (1 - \alpha )\operatorname{KL}(\mu ^{(\alpha , \nu )}, \nu ) \: . \end{align*}

Proof ▶

Apply Lemma 2.6.17 to \(\xi = \mu ^{(\alpha , \nu )}\).

Lemma 2.6.19

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} (1 - \alpha ) R_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) \right) \: . \end{align*}

The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).

Proof ▶

By Lemma 2.6.17 and Lemma 2.4.8, for all \(\xi \),

\begin{align*} \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) - (1 - \alpha ) R_\alpha (\mu , \nu ) = \operatorname{KL}(\xi , \mu ^{(\alpha , \nu )}) \ge 0 \: . \end{align*}

We have then \((1 - \alpha ) R_\alpha (\mu , \nu ) \le \inf _{\xi \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha )\operatorname{KL}(\xi , \nu ) \right)\). Corollary 2.6.18 proves the other inequality.

Lemma 2.6.20

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then

\begin{align*} (1 - \alpha ) R_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)} \operatorname{KL}\left( \pi _\alpha \times \xi , \pi _\alpha \otimes P \right) \: . \end{align*}

The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).

Compare with Lemma 2.9.5 on the Jensen-Shannon divergence.

Proof ▶

\begin{align*} \operatorname{KL}\left( \pi _\alpha \times \xi , \pi _\alpha \otimes P \right) & = \alpha \operatorname{KL}(\xi , \mu ) + (1 - \alpha ) \operatorname{KL}(\xi , \nu ) \: . \end{align*}

The result is a rephrasing of Lemma 2.6.19.

2.6.4 Chain rule and tensorization

Theorem 2.6.21 Chain rule

Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels. Then \(R_\alpha (\mu \otimes \kappa , \nu \otimes \eta ) = R_\alpha (\mu , \nu ) + R_\alpha (\kappa , \eta \mid \mu ^{(\alpha , \nu )})\).

Proof ▶

Theorem 2.6.22 Chain rule, with Bayesian inverse

\begin{align*} R_\alpha (\mu \otimes \kappa , \nu \otimes \eta ) = R_\alpha (\kappa \circ \mu , \eta \circ \nu ) + R_\alpha (\kappa _\mu ^\dagger , \eta _\nu ^\dagger \mid (\kappa \circ \mu )^{(\alpha , \eta \circ \nu )}) \: . \end{align*}

Proof ▶

Corollary 2.6.23

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\xi , \lambda \in \mathcal P(\mathcal Y)\). Then \(R_\alpha (\mu \times \xi , \nu \times \lambda ) = R_\alpha (\mu , \nu ) + R_\alpha (\xi , \lambda )\).

Proof ▶

Apply Theorem 2.6.21 (the product of measures is a special case of \(\otimes \)). TODO: more details.

Theorem 2.6.24 Tensorization - finite product

Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).

Proof ▶

Induction over the finite index set, using Corollary 2.6.23.

Corollary 2.6.25

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(R_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_\alpha (\mu , \nu )\).

Proof ▶

Apply Theorem 2.6.24.

Theorem 2.6.26 Tensorization - countable product

Let \(I\) be a countable index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).

Proof ▶

TODO: find a way to hide the following dummy lemma

Lemma 2.6.27 Dummy lemma: Renyi properties

✓

Dummy node to summarize properties of the Rényi divergence.

Proof ▶

2.7 Hellinger distance

TODO: refactor this section to link it to the Hellinger \(\alpha \)-divergence.

Definition 28 Squared Hellinger distance

✓

Let \(\mu , \nu \) be two measures. The squared Hellinger distance between \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{H^2}(\mu , \nu ) = D_f(\mu , \nu ) \quad \text{with } f: x \mapsto \frac{1}{2}\left( 1 - \sqrt{x} \right)^2 \: . \end{align*}

Lemma 2.7.1 Hellinger and Rényi

Let \(\mu , \nu \) be two probability measures. Then \(R_{1/2}(\mu , \nu ) = -2\log (1 - \operatorname{H^2}(\mu , \nu ))\).

Proof ▶

\(R_{1/2}(\mu , \nu ) = -2 \log (1 - \frac{1}{2} D_f(\nu , \mu ))\) for \(f : x \mapsto -2 (\sqrt{x} - 1)\). Using Lemma 2.3.6, \(R_{1/2}(\mu , \nu ) = -2 \log (1 - D_g(\nu , \mu ))\) for \(g(x) = 1 - \sqrt{x}\).

It suffices then to show that \(\operatorname{H^2}(\mu , \nu ) = D_g(\mu , \nu )\), which is true by an application of Lemma 2.3.8.

Lemma 2.7.2

Let \(\mu , \nu \) be two probability measures. Then \(2 \operatorname{H^2}(\mu , \nu ) \le R_{1/2}(\mu , \nu )\).

Proof ▶

Use Lemma 2.7.1, then \(-\log (1 - x) \ge x\).

Lemma 2.7.3

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{H^2}(\mu , \nu ) \le \operatorname{TV}(\mu , \nu )\).

Proof ▶

Lemma 2.7.4

Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{TV}(\mu , \nu ) \le \sqrt{\operatorname{H^2}(\mu , \nu )(2 - \operatorname{H^2}(\mu , \nu ))}\).

Proof ▶

Corollary 2.7.5

Let \(\mu , \nu \) be two probability measures. Then

\begin{align*} \frac{1}{2}(1 - \operatorname{H^2}(\mu , \nu ))^2 \le 1 - \operatorname{TV}(\mu , \nu ) - \frac{1}{2}(1 - \operatorname{TV}(\mu , \nu ))^2 \: . \end{align*}

Proof ▶

We use Lemma 2.7.4.

\begin{align*} 1 - \operatorname{TV}(\mu , \nu ) - \frac{1}{2}(1 - \operatorname{H^2}(\mu , \nu ))^2 & = \frac{1}{2} - \operatorname{TV}(\mu , \nu ) + \frac{1}{2}(\operatorname{H^2}(\mu , \nu )(2 - \operatorname{H^2}(\mu , \nu ))) \\ & \ge \frac{1}{2} - \operatorname{TV}(\mu , \nu ) + \frac{1}{2}\operatorname{TV}^2(\mu , \nu ) \\ & = \frac{1}{2}(1 - \operatorname{TV}(\mu , \nu ))^2 \: . \end{align*}

Corollary 2.7.6

Let \(\mu , \nu \) be two probability measures. Then

\begin{align*} 1 - \sqrt{1 - (1 - \operatorname{H^2}(\mu , \nu ))^2} \le 1 - \operatorname{TV}(\mu , \nu ) \le 1 - \operatorname{H^2}(\mu , \nu ) \: . \end{align*}

Proof ▶

Use Lemma 2.7.4 and 2.7.3.

Corollary 2.7.7

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(n \in \mathbb {N}\), and \(\mu ^{\otimes n}, \nu ^{\otimes n}\) be their product measures on \(\mathcal X^n\). Then

\begin{align*} \frac{1}{2}e^{-n R_{1/2}(\mu , \nu )} \le 1 - \sqrt{1 - e^{-n R_{1/2}(\mu , \nu )}} \le 1 - \operatorname{TV}(\mu ^{\otimes n}, \nu ^{\otimes n}) \le e^{-\frac{1}{2} n R_{1/2}(\mu , \nu )} \: . \end{align*}

Proof ▶

Use Corollary 2.7.6 and Lemma 2.7.1 to get to a comparison of \(\operatorname{TV}(\mu , \nu )\) and \(R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n})\). Finally \(R_{1/2}(\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_{1/2}(\mu , \nu )\) by Lemma 2.6.25.

2.8 Chernoff divergence

Definition 29 Chernoff divergence

✓

The Chernoff divergence of order \(\alpha {\gt} 0\) between two measures \(\mu \) and \(\nu \) on \(\mathcal X\) is

\begin{align*} C_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\max \{ R_\alpha (\xi , \mu ), R_\alpha (\xi , \nu )\} \: . \end{align*}

Note: the name Chernoff divergence is usually reserved for \(C_1\), and any reference to \(C\) without subscript in this document refers to \(C_1\). The extension to other orders \(\alpha \) is motivated by the appearance of \(C_\alpha \) in a result further down. This is not a well established notion and it remains to see if this extension is meaningful.

Lemma 2.8.1

✓

\(C_1(\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\max \{ \operatorname{KL}(\xi , \mu ), \operatorname{KL}(\xi , \nu )\} \) .

Proof ▶

This is \(R_1 = \operatorname{KL}\) (by definition).

Lemma 2.8.2 Symmetry

\(C_\alpha (\mu , \nu ) = C_\alpha (\nu , \mu )\).

Proof ▶

Immediate from the definition.

Lemma 2.8.3 Monotonicity

The function \(\alpha \mapsto C_\alpha (\mu , \nu )\) is monotone.

Proof ▶

Consequence of Lemma 2.6.13.

Lemma 2.8.4

\begin{align*} C_1(\mu , \nu ) = \max _{\alpha \in [0,1]} (1 - \alpha )R_\alpha (\mu , \nu ) \: . \end{align*}

Proof ▶

2.9 Jensen-Shannon divergence

Definition 30 Jensen-Shannon divergence

The Jensen-Shannon divergence indexed by \(\alpha \in (0,1)\) between two measures \(\mu \) and \(\nu \) is

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \alpha \operatorname{KL}(\mu , \alpha \mu + (1 - \alpha )\nu ) + (1 - \alpha ) \operatorname{KL}(\nu , \alpha \mu + (1 - \alpha )\nu ) \: . \end{align*}

Lemma 2.9.1

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \operatorname{KL}(\pi _\alpha \otimes P, \pi _\alpha \times (P \circ \pi _\alpha )) \: . \end{align*}

Proof ▶

Lemma 2.9.2

\(\operatorname{JS}_\alpha \) is an \(f\)-divergence for \(f(x) = \alpha x \log (x) - (\alpha x + 1 - \alpha ) \log (\alpha x + 1 - \alpha )\) .

Proof ▶

\((\mu , \nu ) \mapsto \operatorname{KL}(\mu , \alpha \mu + (1 - \alpha ) \nu )\) is an \(f\)-divergence for the function \(x \mapsto x\log \left(\frac{x}{\alpha x + 1 - \alpha }\right)\) (Lemma 2.3.12).

\((\mu , \nu ) \mapsto \operatorname{KL}(\nu , \alpha \mu + (1 - \alpha ) \nu )\) is an \(f\)-divergence for the function \(x \mapsto x\log \left(\frac{1}{\alpha x + 1 - \alpha }\right)\) (Lemma 2.3.13).

By Lemmas 2.3.6 and 2.3.7, we obtain the result.

Lemma 2.9.3

For \(\alpha \in (0,1)\) and \(\mu , \nu \in \mathcal M(\mathcal X)\),

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \operatorname{JS}_{1 - \alpha }(\nu , \mu ) \: . \end{align*}

Proof ▶

Immediate from the definition.

Lemma 2.9.4

Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Then

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\mu , \xi ) + (1 - \alpha )\operatorname{KL}(\nu , \xi ) \right) \: . \end{align*}

The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).

Proof ▶

Lemma 2.9.5

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) = \inf _{\xi \in \mathcal P(\mathcal X)} \operatorname{KL}\left( \pi _\alpha \otimes P, \pi _\alpha \times \xi \right) \: . \end{align*}

The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).

Compare with Lemma 2.6.20 on the Rényi divergence.

Proof ▶

Lemma 2.9.6

Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(\operatorname{JS}_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) \le n \operatorname{JS}_\alpha (\mu , \nu )\).

Proof ▶

By Lemma 2.9.4,

\begin{align*} \operatorname{JS}_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) & = \inf _{\xi \in \mathcal P(\mathcal X^n)}\left( \alpha \operatorname{KL}(\mu ^{\otimes n}, \xi ) + (1 - \alpha )\operatorname{KL}(\nu ^{\otimes n}, \xi ) \right) \\ & \le \inf _{\zeta \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\mu ^{\otimes n}, \zeta ^{\otimes n}) + (1 - \alpha )\operatorname{KL}(\nu ^{\otimes n}, \zeta ^{\otimes n}) \right) \\ & = n \inf _{\zeta \in \mathcal P(\mathcal X)}\left( \alpha \operatorname{KL}(\mu , \zeta ) + (1 - \alpha )\operatorname{KL}(\nu , \zeta ) \right) \\ & = n \operatorname{JS}_\alpha (\mu , \nu ) \: . \end{align*}

2.9.1 Comparisons with other divergences

Lemma 2.9.7

For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha , \lambda \in (0,1)\) ,

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) \le \frac{(1 - \alpha )^{1 - \lambda } \alpha ^\lambda }{\lambda ^{1 - \lambda } (1 - \lambda )^\lambda } (1 - \lambda )\operatorname{H}_\lambda (\mu , \nu ) \: . \end{align*}

In particular,

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & \le \left(\frac{1 - \alpha }{\alpha }\right)^{1 - 2 \alpha } (1 - \alpha )\operatorname{H}_\alpha (\mu , \nu ) \: , \\ \operatorname{JS}_\alpha (\mu , \nu ) & \le \alpha \operatorname{H}_{1 - \alpha }(\mu , \nu ) \: . \end{align*}

Proof ▶

The main tool is Lemma 2.3.66. \(\operatorname{JS}_\alpha \) is an \(f\)-divergence with function \(f(x) = \alpha x \log (x) - (\alpha x + 1 - \alpha ) \log (\alpha x + 1 - \alpha )\). On probability measures, \((1 - \lambda )\operatorname{H}_\lambda \) is an \(f\)-divergence with function \(g(x) = 1 - x^\lambda + \lambda (x-1)\). These two functions satisfy \(f(1) = 0\), \(g(1) = 0\), \(f'(1) = 0\), \(g'(1) = 0\) and have second derivatives, hence Lemma 2.3.66 applies and to obtain the inequality it suffices to compute \(\sup _x\frac{f''(x)}{g''(x)}\) .

\begin{align*} f”(x) & = \frac{\alpha (1 - \alpha )}{x(\alpha x + 1 - \alpha )} \: , \\ g”(x) & = \lambda (1 - \lambda ) x^{\lambda - 2} \: . \end{align*}

Let \(h(x) = \frac{f''(x)}{g''(x)} = \frac{\alpha (1 - \alpha )}{\lambda (1 - \lambda )} \frac{x^{1 - \lambda }}{\alpha x + 1 - \alpha }\) . Its maximum is attained at a point where \(h'(x) = 0\).

\begin{align*} h’(x) = \frac{\alpha (1 - \alpha )}{\lambda (1 - \lambda )} \frac{(1 - \alpha )(1 - \lambda )x^{-\lambda } - \alpha \lambda x^{1 - \lambda }}{(\alpha x + 1 - \alpha )^2} \: . \end{align*}

The optimum for \(h\) is attained at \(x = \frac{(1 - \alpha )(1 - \lambda )}{\alpha \lambda }\) and the maximal value of \(h\) is \(\frac{(1 - \alpha )^{1 - \lambda } \alpha ^\lambda }{\lambda ^{1 - \lambda } (1 - \lambda )^\lambda }\) .

Corollary 2.9.8

For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha \in (0,1)\), \(\lambda \le 1/2\) ,

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & \le \alpha ^{1 - \lambda } \operatorname{H}_{1 - \lambda }(\mu , \nu ) \: . \end{align*}

Proof ▶

By Lemma 2.9.7 for \(\alpha \) and \(1 - \lambda \),

\begin{align*} \operatorname{JS}_\alpha (\mu , \nu ) & \le \frac{(1 - \alpha )^{\lambda } \alpha ^{1 - \lambda }}{\lambda ^{1 - \lambda } (1 - \lambda )^\lambda } \lambda \operatorname{H}_{1 - \lambda }(\mu , \nu ) \\ & = \alpha ^{1 - \lambda } (1 - \alpha )^\lambda \left(\frac{\lambda }{1 - \lambda }\right)^\lambda \operatorname{H}_{1 - \lambda }(\mu , \nu ) \: . \end{align*}

To get the result, use \(1 - \alpha \le 1\) and \(\frac{\lambda }{1 - \lambda } \le 1\) since it is increasing in \(\lambda \) with value 1 at \(\lambda = 1/2\).