- Boxes
- definitions
- Ellipses
- theorems and lemmas
- Blue border
- the statement of this result is ready to be formalized; all prerequisites are done
- Orange border
- the statement of this result is not ready to be formalized; the blueprint needs more work
- Blue background
- the proof of this result is ready to be formalized; all prerequisites are done
- Green border
- the statement of this result is formalized
- Green background
- the proof of this result is formalized
- Dark green background
- the proof of this result and all its ancestors are formalized
For \(\alpha \in (0,1)\),
TODO: move this somewhere after the definition of \(KL\).
For \(\mu , \nu \in \mathcal P(\mathcal X)\),
TODO: move this somewhere after the definition of \(\operatorname{H}_\alpha \).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\). For \(\alpha {\gt} 0\),
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Then \(D_f(\mu , \nu ) \ge d_f(\mu (E), \nu (E))\).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow [0,1]\). Then
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\xi \) be a measure on \(\mathcal Y\). Then \(D_f(\mu \times \xi , \nu \times \xi ) = D_f(\mu , \nu )\).
Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).
For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha \in (0,1)\), \(\lambda \le 1/2\) ,
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then
Let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P, Q : \{ 0,1\} \rightsquigarrow \mathcal X\). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Then
Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then
For \(\mu \in \mathcal X\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) two Markov kernels,
For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),
Let \(\mu , \nu \) be two probability measures. Then
Let \(\mu , \nu \) be two probability measures. Then
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(n \in \mathbb {N}\), and \(\mu ^{\otimes n}, \nu ^{\otimes n}\) be their product measures on \(\mathcal X^n\). Then
Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\xi , \lambda \in \mathcal P(\mathcal Y)\). Then \(R_\alpha (\mu \times \xi , \nu \times \lambda ) = R_\alpha (\mu , \nu ) + R_\alpha (\xi , \lambda )\).
Let \(\mu \in \mathcal M(\mathcal X)\) be a finite measure and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \((\mu \otimes \eta )\)-almost surely,
Let \(\mu , \nu \) be two measures on \(\mathcal X\), \(\xi \in \mathcal M(\{ 0,1\} )\) and let \(E\) be an event on \(\mathcal X\). Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\mathcal I_\xi (\mu , \nu ) \ge \mathcal I_\xi (\mu _E, \nu _E)\).
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Then
The Bayes binary risk between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\), denoted by \(\mathcal B_\xi (\mu , \nu )\), is the Bayes risk \(\mathcal R^P_\xi \) for \(\Theta = \mathcal Y = \mathcal Z = \{ 0,1\} \), \(\ell (y,z) = \mathbb {I}\{ y \ne z\} \), \(P\) the kernel sending 0 to \(\mu \) and 1 to \(\nu \) and prior \(\xi \). That is,
in which the infimum is over Markov kernels.
If the prior is a probability measure with weights \((\pi , 1 - \pi )\), we write \(B_\pi (\mu , \nu ) = \mathcal B_{(\pi , 1 - \pi )}(\mu , \nu )\) .
The Bayesian risk of an estimator \(\hat{y}\) on \((P, y, \ell ')\) for a prior \(\pi \in \mathcal M(\Theta )\) is \(R^P_\pi (\hat{y}) = \pi \left[\theta \mapsto r^P_\theta (\hat{y})\right]\) . It can also be expanded as \(R^P_\pi (\hat{y}) = (\pi \otimes (\hat{y} \circ P))\left[ (\theta , z) \mapsto \ell '(y(\theta ), z) \right]\) .
For \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\), a Bayesian inverse of \(\kappa \) is a Markov kernel \(\kappa _\mu ^\dagger : \mathcal Y \rightsquigarrow \mathcal X\) such that \(\mu \otimes \kappa = ((\kappa \circ \mu ) \otimes \kappa _\mu ^\dagger )_\leftrightarrow \) in which \((\cdot )_\leftrightarrow \) denotes swapping the two coordinates. If such an inverse exists it is unique up to a \((\kappa \circ \mu )\)-null set, and we talk about the Bayesian inverse of \(\kappa \) with respect to \(\mu \).
The Bayes risk of \((P, y, \ell ')\) for prior \(\pi \in \mathcal M(\Theta )\) is \(\mathcal R^P_\pi = \inf _{\hat{y} : \mathcal X \rightsquigarrow \mathcal Z} R^P_\pi (\hat{y})\) , where the infimum is over Markov kernels.
The Bayes risk of \((P, y, \ell ')\) is \(\mathcal R^*_B = \sup _{\pi \in \mathcal P(\Theta )} \mathcal R^P_\pi \: .\)
The sample complexity of simple binary hypothesis testing with prior \((\pi , 1 - \pi ) \in \mathcal P(\{ 0, 1\} )\) at risk level \(\delta \in \mathbb {R}_{+, \infty }\) is
This is the sample complexity \(n_\xi ^P(\delta )\) of Definition 9 specialized to simple binary hypothesis testing.
We define a partial order on kernels by the following. Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X \rightsquigarrow \mathcal Z\) (with same domain as \(\kappa \)). Then \(\kappa \) is Blackwell sufficient for \(\eta \), denoted by \(\eta \le _B \kappa \), if there exists a Markov kernel \(\xi : \mathcal Y \rightsquigarrow \mathcal Z\) such that \(\eta = \xi \circ \kappa \).
Let \(D\) be a divergence. The conditional divergence of kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) with respect to a measure \(\mu \in \mathcal M(\mathcal X)\) is \(\mu [x \mapsto D(\kappa (x), \eta (x)]\). It is denoted by \(D(\kappa , \eta \mid \mu )\).
Let \(f : \mathbb {R} \to \mathbb {R}\), \(\mu \) a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels from \(\mathcal X\) to \(\mathcal Y\). The conditional f-divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is
if \(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.
Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Hellinger divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is
for \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).
Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Kullback-Leibler divergence between \(\kappa \) and \(\eta \) with respect to \(\mu \) is
if \(x \mapsto \operatorname{KL}(\kappa (x), \eta (x))\) is \(\mu \)-integrable and \(+\infty \) otherwise.
Let \(\kappa : \mathcal Z \rightsquigarrow \mathcal X \times \mathcal Y\). The conditional mutual information of \(\kappa \) with respect to \(\nu \in \mathcal M(\mathcal Z)\) is
Let \(\mu \) be a measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two kernels. The conditional Rényi divergence of order \(\alpha \in (0,+\infty ) \backslash \{ 1\} \) between \(\kappa \) and \(\eta \) conditionally to \(\mu \) is
Let \(f: \mathbb {R} \to \mathbb {R}\) be a convex function. Then its right derivative \(f'_+(x) \coloneqq \lim _{y \downarrow x}\frac{f(y) - f(x)}{y - x}\) is a Stieltjes function (a monotone right continuous function) and it defines a measure \(\gamma _f\) on \(\mathbb {R}\) by \(\gamma _f((x,y]) \coloneqq f'_+(y) - f'_+(x)\) . [ Lie12 ] calls \(\gamma _f\) the curvature measure of \(f\).
The deterministic kernel defined by a measurable function \(f : \mathcal X \to \mathcal Y\) is the kernel \(d_f: \mathcal X \rightsquigarrow \mathcal Y\) defined by \(d_f(x) = \delta _{f(x)}\), where for any \(y \in \mathcal Y\), \(\delta _y\) is the Dirac probability measure at \(y\).
A divergence between measures is a function \(D\) which for any measurable space \(\mathcal X\) and any two measures \(\mu , \nu \in \mathcal M(\mathcal X)\), returns a value \(D(\mu , \nu ) \in \mathbb {R} \cup \{ +\infty \} \).
A divergence \(D\) is said to satisfy the data-processing inequality (DPI) if for all measurable spaces \(\mathcal X, \mathcal Y\), all \(\mu , \nu \in \mathcal M(\mathcal X)\) and all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),
Let \(f : \mathbb {R} \to \mathbb {R}\) and let \(\mu , \nu \) be two measures on a measurable space \(\mathcal X\). The f-divergence between \(\mu \) and \(\nu \) is
if \(x \mapsto f\left(\frac{d \mu }{d \nu }(x)\right)\) is \(\nu \)-integrable and \(+\infty \) otherwise.
The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on \((P, y, \ell ')\) is the deterministic estimator \(\mathcal X \to \mathcal Z\) given by
if there exists such a measurable argmin.
Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Hellinger divergence of order \(\alpha \in [0,+\infty )\) between \(\mu \) and \(\nu \) is
with \(f_\alpha : x \mapsto \frac{x^{\alpha } - 1}{\alpha - 1}\).
The Jensen-Shannon divergence indexed by \(\alpha \in (0,1)\) between two measures \(\mu \) and \(\nu \) is
Let \(\mathcal X, \mathcal Y\) be two measurable spaces. A probability transition kernel (or simply kernel) from \(\mathcal X\) to \(\mathcal Y\) is a measurable map from \(\mathcal X\) to \(\mathcal M (\mathcal Y)\), the measurable space of measures on \(\mathcal Y\). We write \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) for a kernel \(\kappa \) from \(\mathcal X\) to \(\mathcal Y\).
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) be two kernels. The composition of \(\kappa \) and \(\eta \) is the kernel \(\eta \circ \kappa : \mathcal X \rightsquigarrow \mathcal Z\) such that for all measurable functions \(f : \mathcal Z \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X\),
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : (\mathcal X \times \mathcal Y) \rightsquigarrow \mathcal Z\) be two s-finite kernels. The composition-product of \(\kappa \) and \(\eta \) is a kernel \(\kappa \otimes \eta : \mathcal X \rightsquigarrow (\mathcal Y \times \mathcal Z)\) such that for all measurable functions \(f : \mathcal Y \times \mathcal Z \to \mathbb {R}_{+,\infty }\) and \(x \in \mathcal X\) ,
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X' \rightsquigarrow \mathcal Y'\) be two s-finite kernels. The parallel product of \(\kappa \) and \(\eta \) is the kernel \(\kappa \parallel \eta : \mathcal X \times \mathcal X' \rightsquigarrow \mathcal Y \times \mathcal Y'\) such that for all measurable functions \(f : \mathcal Y \times \mathcal Y' \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X \times \mathcal X'\),
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X \rightsquigarrow \mathcal Z\) be two s-finite kernels. The product of \(\kappa \) and \(\eta \) is the kernel \(\kappa \times \eta : \mathcal X \rightsquigarrow \mathcal Y \times \mathcal Z\) such that for all measurable functions \(f : \mathcal Y \times \mathcal Z \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X\),
Let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. The Radon-Nikodym derivative of \(\kappa \) with respect to \(\eta \), denoted by \(\frac{d \kappa }{d \eta }\), is a measurable function \(\mathcal X \times \mathcal Y \to \mathbb {R}_{+, \infty }\) with \(\kappa = \frac{d \kappa }{d \eta } \cdot \eta + \kappa _{\perp \eta }\), where for all \(x\), \(\kappa _{\perp \eta }(x) \perp \eta (x)\).
Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Kullback-Leibler divergence between \(\mu \) and \(\nu \) is
if \(\mu \ll \nu \) and \(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable and \(+\infty \) otherwise.
Let \(\mu \in \mathcal M(\mathcal X)\) be an s-finite measure and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be an s-finite kernel. Let \(\mathcal U\) be a measurable space with a unique element \(u\). Let \(\mu _k : \mathcal U \rightsquigarrow \mathcal X\) be the constant kernel with value \(\mu \). The composition-product of \(\mu \) and \(\kappa \) is the measure on \(\mathcal M(\mathcal X \times \mathcal Y)\) defined by \((\mu _k \otimes \kappa ) (u)\) .
Let \(\mathcal B\) be the Borel \(\sigma \)-algebra on \(\mathbb {R}_{+,\infty }\). Let \(\mathcal X\) be a measurable space. For a measurable set \(s\) of \(\mathcal X\), let \((\mu \mapsto \mu (s))^* \mathcal B\) be the \(\sigma \)-algebra on \(\mathcal M(\mathcal X)\) defined by the comap of the evaluation function at \(s\). Then \(\mathcal M(\mathcal X)\) is a measurable space with \(\sigma \)-algebra \(\bigsqcup _{s} (\mu \mapsto \mu (s))^* \mathcal B\) where the supremum is over all measurable sets \(s\).
The mutual information is, for \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\) ,
Let \(D\) be a divergence between measures. The left \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is
Let \(D\) be a divergence between measures. The right \(D\)-mutual information for a measure \(\mu \in \mathcal M(\mathcal X)\) and a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is
The sample complexity of Bayesian estimation with respect to a prior \(\pi \in \mathcal M(\Theta )\) at risk level \(\delta \in \mathbb {R}_{+,\infty }\) is
Let \(\mu , \nu \) be two measures on \(\mathcal X\). The Rényi divergence of order \(\alpha \in \mathbb {R}\) between \(\mu \) and \(\nu \) is
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\alpha \in (0, +\infty ) \backslash \{ 1\} \). Let \(p = \frac{d \mu }{d (\mu + \nu )}\) and \(q = \frac{d \nu }{d (\mu + \nu )}\). We define a measure \(\mu ^{(\alpha , \nu )}\), absolutely continuous with respect to \(\mu + \nu \) with density
The Bayes risk increase \(I^P_{\pi }(\kappa )\) of a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) with respect to the estimation problem \((P, y, \ell ')\) and the prior \(\pi \in \mathcal M(\Theta )\) is the difference of the Bayes risk of \((\kappa \circ P, y, \ell ')\) and that of \((P, y, \ell ')\). That is,
The statistical information between measures \(\mu \) and \(\nu \) with respect to prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(\mathcal I_\xi (\mu , \nu ) = \min \{ \xi _0 \mu (\mathcal X), \xi _1 \nu (\mathcal X)\} - \mathcal B_\xi (\mu , \nu )\). This is the risk increase \(I_\xi ^P(d_{\mathcal X})\) in the binary hypothesis testing problem for \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) the Markov kernel to the point space.
For \(a,b \in (0, +\infty )\) let \(\phi _{a,b} : \mathbb {R} \to \mathbb {R}\) be the function defined by
Let \(\mu , \nu \) be two \(\sigma \)-finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two s-finite kernels. Then
if \(\mu \otimes \kappa \ll \nu \otimes \eta \) then \(\mu \otimes \kappa \ll \mu \otimes \eta \),
if \(\mu \otimes \kappa \ll \nu \otimes \eta \) and \(\kappa (x) \ne 0\) for all \(x\) then \(\mu \ll \nu \),
if \(\mu \ll \nu \) and \(\mu \otimes \kappa \ll \mu \otimes \eta \) then \(\mu \otimes \kappa \ll \nu \otimes \eta \).
In particular,
if \(\kappa (x) \ne 0\) for all \(x\) then \(\mu \otimes \kappa \ll \nu \otimes \eta \iff \left( \mu \ll \nu \ \wedge \ \mu \otimes \kappa \ll \mu \otimes \eta \right)\) .
If \(\mu \ll \nu \) then \(\mu \otimes \kappa \ll \nu \otimes \eta \iff \mu \otimes \kappa \ll \mu \otimes \eta \) .
Let \(\hat{y}_B\) be the generalized Bayes estimator for simple binary hypothesis testing. The distribution \(\ell \circ (\mathrm{id} \parallel \hat{y}_B) \circ (\pi \otimes P)\) (in which \(\ell \) stands for the associated deterministic kernel) is a Bernoulli with mean \(\mathcal B_\pi (\mu , \nu )\).
Let \(\zeta \) be a measure such that \(\mu \ll \zeta \) and \(\nu \ll \zeta \). Let \(p = \frac{d \mu }{d\zeta }\) and \(q = \frac{d \nu }{d\zeta }\). For \(\alpha \in (0,1)\), for \(g_\alpha (x) = \min \{ (\alpha -1)x, \alpha x\} \),
Dummy node to summarize properties of the Bayes binary risk.
For \(\mu , \nu \in \mathcal M(\mathcal X)\) and \(\xi \in \mathcal M(\{ 0,1\} )\), \(\mathcal B_\xi (\mu , \nu ) = \mathcal B_{\xi _{\leftrightarrow }}(\nu , \mu )\) where \(\xi _{\leftrightarrow } \in \mathcal M(\{ 0,1\} )\) is such that \(\xi _{\leftrightarrow }(\{ 0\} ) = \xi _1\) and \(\xi _{\leftrightarrow }(\{ 1\} ) = \xi _0\). For \(\pi \in [0,1]\), \(B_\pi (\mu , \nu ) = B_{1 - \pi }(\nu , \mu )\) .
The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies
whenever the Bayesian inverse \(P_\pi ^\dagger \) of \(P\) with respect to \(\pi \) exists (Definition 48).
The Bayesian risk of a Markov kernel \(\hat{y} : \mathcal X \rightsquigarrow \mathcal Z\) with respect to a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) satisfies
The Bayesian inverse of a kernel \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) with respect to a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(P_\xi ^\dagger (x) = \left(\xi _0\frac{d P(0)}{d(P \circ \xi )}(x), \xi _1\frac{d P(1)}{d(P \circ \xi )}(x)\right)\) (almost surely w.r.t. \(P \circ \xi = \xi _0 P(0) + \xi _1 P(1)\)).
Let \(\mu \in \mathcal M(\mathcal X)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\). Then \((\eta \circ \kappa \circ \mu )\)-a.e.,
For \(\mu \in \mathcal M(\mathcal X)\) s-finite, \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel and \(\kappa _\mu ^\dagger \) the Bayesian inverse of \(\kappa \) with respect to \(\mu \), these objects satisfy the equality \(\kappa _\mu ^\dagger \circ \kappa \circ \mu = \mu \).
Dummy node to summarize properties of the Bayesian inverse.
For \(\Theta = \{ 0,1\} \), the Bayes risk of a prior \(\xi \in \mathcal M(\{ 0,1\} )\) is
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is
When \(\pi \) is a probability measure and \(P\) is a Markov kernel, \((P \circ \pi )[1] = 1\).
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is
Suppose that \(\Theta \) is finite and let \(\xi \in \mathcal P(\Theta )\). The Bayes risk with respect to the prior \(\pi \in \mathcal P(\Theta )\) for \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\), \(P\) a Markov kernel and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) satisfies
For \(P : \Theta \rightsquigarrow \mathcal X\) and \(\kappa : \Theta \times \mathcal X \rightsquigarrow \mathcal X'\) a Markov kernel, \(\mathcal R^{P \otimes \kappa }_\pi \le \mathcal R^{(P \otimes \kappa )_{\mathcal X'}}_\pi \) , in which \((P \otimes \kappa )_{\mathcal X'} : \mathcal\Theta \rightsquigarrow \mathcal X'\) is the kernel obtained by marginalizing over \(\mathcal X\) in the output of \(P \otimes \kappa \) .
The Bayes risk \(\mathcal R_\pi ^P\) is concave in \(P : \Theta \rightsquigarrow \mathcal X\) .
The Bayes risk of a prior \(\pi \in \mathcal M(\Theta )\) on \((P, y, \ell ')\) with \(P\) a constant Markov kernel is
In particular, it does not depend on \(P\).
When the generalized Bayes estimator is well defined, the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is
If \(n \le m\) then \(\mathcal R_\pi ^{P^{\otimes n}} \ge \mathcal R_\pi ^{P^{\otimes m}}\).
For \(\delta \ge \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing is \(n(\mu , \nu , \pi , \delta ) = 0\) .
For \(\delta \le \min \{ \pi , 1 - \pi \} \), the sample complexity of simple binary hypothesis testing satisfies
in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.
For \(\delta \le \pi \le 1/2\), the sample complexity of simple binary hypothesis testing satisfies
The sample complexity of simple binary hypothesis testing satisfies \(n(\mu , \nu , \pi , \delta ) \le n_0\) , with \(n_0\) the smallest natural number such that
Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\) be such that \(\zeta \otimes Q \ll \pi \otimes P\). Then for all \(\beta \in \mathbb {R}\),
For \(\alpha \in (0,1)\), let \(\pi _\alpha \in \mathcal P(\{ 0,1\} )\) be the measure \((\alpha , 1 - \alpha )\). Let \(\alpha , \gamma \in (0,1)\), \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \) . Then for all \(\beta {\gt} 0\) and \(\varepsilon {\gt}0\) ,
Consider an estimation problem with loss \(\ell ' : \mathcal Y \times \mathcal Z \to [0,1]\). Let \(\pi , \zeta \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Then for all \(\beta \in \mathbb {R}\),
in which the infimum over \(\xi \) is restricted to probability measures such that \(\zeta \times \xi \ll \pi \otimes P\) and \(d_{\mathcal X} : \Theta \rightsquigarrow *\) is the discard kernel.
For \(\alpha \in (0,1)\), let \(\pi _\alpha \in \mathcal P(\{ 0,1\} )\) be the measure \((\alpha , 1 - \alpha )\). Let \(\alpha , \gamma \in (0,1)\), \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \) . Then for all \(\beta \in \mathbb {R}\) ,
Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). For \(\beta {\gt} 0\) ,
Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, where \(\kappa \) is a Markov kernel. Then \(D_f(\kappa , \eta \mid \mu ) \ne \infty \) if and only if
for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable,
\(x \mapsto \int _y f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right) \partial \eta (x)\) is \(\mu \)-integrable,
either \(f'(\infty ) {\lt} \infty \) or for \(\mu \)-almost all \(x\), \(\kappa (x) \ll \eta (x)\).
Dummy node to summarize properties of conditional \(f\)-divergences.
For \(f: \mathbb {R} \to \mathbb {R}\) a convex function and \(x,y \in \mathbb {R}\),
For \(f,g: \mathbb {R} \to \mathbb {R}\) two convex functions, the curvature measure of \(f+g\) is \(\gamma _{f+g} = \gamma _f + \gamma _g\) .
For \(a \ge 0\) and \(f: \mathbb {R} \to \mathbb {R}\) a convex function, the curvature measure of \(af\) is \(\gamma _{af} = a \gamma _f\) .
The curvature measure of the function \(\phi _{a,b}\) is \(\gamma _{\phi _{a,b}} = a\delta _{b/a}\) , where \(\delta _x\) is the Dirac measure at \(x\).
Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\). Then
Let \(D\) be a divergence that satisfies the DPI and for which \(D(\kappa , \eta \mid \mu ) = D(\mu \otimes \kappa , \mu \otimes \eta )\). Let \(\mathcal\mu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then
Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then
Let \(D\) be a divergence that satisfies the DPI. Let \(\mathcal\mu , \nu \in \mathcal M(\mathcal X)\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be Markov kernels. Then
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel and \(\nu \in \mathcal M(\mathcal X)\) be a finite measure. Suppose that for all finite measures \(\mu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \), \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\). Then the same is true without the absolute continuity hypothesis.
Let \(\mu \in \mathcal M(\mathcal X)\) be a finite measure and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a finite kernel. If \(\eta : \mathcal Y \rightsquigarrow \mathcal X\) is such that \(\mu \otimes \kappa = ((\kappa \circ \mu ) \otimes \eta )_\leftrightarrow \), then \(\eta (y) = \kappa _\mu ^\dagger (y)\) for \((\kappa \circ \mu )\)-almost all \(y\).
For \(\mathcal X\) standard Borel, \(\mu \) and \(\kappa \) s-finite, the Bayesian inverse of \(\kappa \) with respect to \(\mu \) exists and is obtained by disintegration of the measure \(\mu \otimes \kappa \) on \(\mathcal X \times \mathcal Y\) into a measure \(\kappa \circ \mu \in \mathcal M(\mathcal Y)\) and a Markov kernel \(\kappa _\mu ^\dagger : \mathcal Y \rightsquigarrow \mathcal X\).
Let \(\mu , \nu \) be two measures and \(E\) an event. Then \(\mu (E)\log \frac{\mu (E)}{\nu (E)} \le \mu \left[\mathbb {I}(E)\log \frac{d \mu }{d \nu }\right]\) .
\((\mu , \nu ) \mapsto D_f(a \mu + b \nu , \nu )\) is an \(f\)-divergence for the function \(x \mapsto f(ax + b)\)
\((\mu , \nu ) \mapsto D_f(\mu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{x}{ax+b}\right)\) .
\((\mu , \nu ) \mapsto D_f(\nu , a \mu + b \nu )\) is an \(f\)-divergence for the function \(x \mapsto (ax+b)f\left(\frac{1}{ax+b}\right)\) .
For all \(y \in [0,1]\), \(x \mapsto d_f(x, y)\) is convex and attains a minimum at \(x = y\).
Let \(\mu , \nu \in \mathcal P([0,1])\). Then
Let \(\mu \) be a finite measure on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\mu \otimes \kappa , \mu \otimes \eta ) \ne \infty \iff D_f(\kappa , \eta \mid \mu ) \ne \infty \).
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow (\mathcal X \times \mathcal Y)\) be a Markov kernel such that for all \(x\), \((\kappa (x))_X = \delta _x\). Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) = D_f(\mu , \nu )\).
Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then
If \(f(1) = 0\), \(g(1) = 0\), \(f'(1) = 0\), \(g'(1) = 0\), and both \(f\) and \(g\) have a second derivative, then
Let \(\mu , \nu \in \mathcal M(\mathcal X)\) be finite measures with \(\mu \ll \nu \) and let \(g : \mathcal X \to \mathcal Y\) be a measurable function. Denote by \(g^* \mathcal Y\) the comap of the \(\sigma \)-algebra on \(\mathcal Y\) by \(g\). Then \(D_f(g_* \mu , g_* \nu ) = D_f(\mu _{| g^* \mathcal Y}, \nu _{| g^* \mathcal Y})\) .
Let \(a,b \in [0, +\infty )\) and let \(\mu , \nu \) be two measures on \(\mathcal X\).
in which \(\text{sign}(b-a)\) is \(1\) if \(b-a {\gt} 0\) and \(-1\) if \(b-a \le 0\).
The generalized Bayes estimator for the Bayes binary risk with prior \(\xi \in \mathcal M(\{ 0,1\} )\) is \(x \mapsto \text{if } \xi _1\frac{d \nu }{d(P \circ \xi )}(x) \le \xi _0\frac{d \mu }{d(P \circ \xi )}(x) \text{ then } 0 \text{ else } 1\), i.e. it is equal to \(\mathbb {I}_E\) for \(E = \{ x \mid \xi _1\frac{d \nu }{d(P \circ \xi )}(x) {\gt} \xi _0\frac{d \mu }{d(P \circ \xi )}(x)\} \) .
The generalized Bayes estimator for prior \(\pi \in \mathcal P(\Theta )\) on the estimation problem defined by \(\mathcal Y = \mathcal Z = \Theta \), \(y = \mathrm{id}\) and \(\ell ' = \mathbb {I}\{ \theta \ne z\} \) is
Let \(\mu , \nu \) be two probability measures. Then \(2 \operatorname{H^2}(\mu , \nu ) \le R_{1/2}(\mu , \nu )\).
Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{H^2}(\mu , \nu ) \le \operatorname{TV}(\mu , \nu )\).
For \(\alpha \in (0,1)\cup (1, \infty )\), \(\mu \) a finite measure and \(\nu \) a probability measure, if \(\operatorname{H}_\alpha (\mu , \nu ) {\lt} \infty \) then
Dummy node to summarize properties of the Hellinger \(\alpha \)-divergence.
Let \(\mu \) be a finite measure on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels from \(\mathcal X\) to \(\mathcal Y\). Then \(p \mapsto f \left(\frac{d(\mu \otimes \kappa )}{d(\mu \otimes \eta )}(p)\right)\) is \((\mu \otimes \eta )\)-integrable iff
\(x \mapsto D_f(\kappa (x), \eta (x))\) is \(\mu \)-integrable and
for \(\mu \)-almost all \(x\), \(y \mapsto f \left( \frac{d\kappa (x)}{d\eta (x)}(y) \right)\) is \(\eta (x)\)-integrable.
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels such that \(\mu \otimes \kappa \ll \nu \otimes \eta \). Then \(p \mapsto \log \frac{d \mu \otimes \kappa }{d \nu \otimes \eta }(p)\) is \(\mu \otimes \kappa \)-integrable if and only if the following hold:
\(x \mapsto \log \frac{d \mu }{d \nu }(x)\) is \(\mu \)-integrable
\(x \mapsto \int _y \log \frac{d \kappa (x)}{d \eta (x)}(y) \partial \kappa (x)\) is \(\mu \)-integrable
for \(\mu \)-almost all \(x\), \(y \mapsto \log \frac{d \kappa (x)}{d \eta (x)}(y)\) is \(\kappa (x)\)-integrable
\(\operatorname{JS}_\alpha \) is an \(f\)-divergence for \(f(x) = \alpha x \log (x) - (\alpha x + 1 - \alpha ) \log (\alpha x + 1 - \alpha )\) .
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Then
The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
The infimum is attained at \(\xi = \alpha \mu + (1 - \alpha ) \nu \).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
For \(\mu , \nu \in \mathcal P(\mathcal X)\) and \(\alpha , \lambda \in (0,1)\) ,
In particular,
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(\operatorname{JS}_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) \le n \operatorname{JS}_\alpha (\mu , \nu )\).
For \(\alpha \in (0,1)\) and \(\mu , \nu \in \mathcal M(\mathcal X)\),
Let \(\mu , \nu \) be two measures and \(E\) an event. Let \(\mu _{|E}\) be the measure defined by \(\mu _{|E}(A) = \frac{\mu (A \cap E)}{\mu (E)}\) and define \(\nu _{|E}\), \(\mu _{| E^c}\) and \(\nu _{| E^c}\) similarly. Let \(\operatorname{kl}(p,q) = p\log \frac{p}{q} + (1-p)\log \frac{1-p}{1-q}\) be the Kullback-Leibler divergence between Bernoulli distributions with means \(p\) and \(q\). Then
Let \(\pi , \xi \in \mathcal P(\Theta )\) and \(P, Q : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) takes values in \([0,1]\). Then
Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(\alpha , \beta \in (0, 1)\). Let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). We write \(\pi _\alpha \) for the probability measure on \(\{ 0,1\} \) with \(\pi _\alpha (\{ 0\} ) = \alpha \). Let \(\bar{\beta } = \min \{ \beta , 1 - \beta \} \). Then
Let \(\mu _1, \nu _1\) be finite measures on \(\mathcal X\) and \(\mu _2, \nu _2\) probability measures on \(\mathcal{Y}\). Then
Dummy node to summarize properties of the Kullback-Leibler divergence.
Let \(\mu , \nu , \xi \) be three measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then
Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(E\) be an event on \(\mathcal X\). Let \(\beta \in \mathbb {R}\). Then
Let \(\mu , \nu , \xi \in \mathcal P(\mathcal X)\) and let \(E\) be an event on \(\mathcal X\). Let \(\beta _1, \beta _2 \in \mathbb {R}\). Then
Let \(\mu , \nu \) be two measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(f : \mathcal X \to [0,1]\) be a measurable function. Let \(\beta \in \mathbb {R}\). Then
Let \(\mu , \nu \) be two measures on \(\mathcal X\) such that \(\mu \left[\left(\log \frac{d \mu }{d \nu }\right)^2\right] {\lt} \infty \). Let \(E\) be an event on \(\mathcal X\) and let \(\beta {\gt} 0\). Then
For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \) and \(n \in \mathbb {N}\), \(\nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}\)-almost surely, \(\frac{d \mu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}{d \nu ^{\otimes \mathbb {N}}_{| \mathcal F_n}}(x) = \prod _{m=1}^n \frac{d \mu }{d \nu }(x_m)\).
For \(\mu , \nu \in \mathcal P(\mathcal X)\) with \(\mu \ll \nu \), \(\nu _\tau \)-almost surely, \(\frac{d \mu _\tau }{d \nu _\tau }(x) = \prod _{n=1}^\tau \frac{d \mu }{d \nu }(x_n)\).
For \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel,
where in the conditional divergence the measure \(\kappa \circ \mu \) should be understood as the constant kernel from \(\mathcal X\) to \(\mathcal Y\) with that value.
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
For \(\mu \in \mathcal P(\{ 0,1\} )\) and \(\kappa : \{ 0,1\} \rightsquigarrow \mathcal Y\),
Let \(\pi \in \mathcal P(\Theta )\) and \(P : \Theta \rightsquigarrow \mathcal X\). Suppose that the loss \(\ell '\) of an estimation task with kernel \(P\) takes values in \([0,1]\). Then
For \(\mu \in \mathcal P(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\),
Let \(\mu , \nu \) be two \(\sigma \)-finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Let \(\mu \sqcap \nu \) denote the infimum of \(\mu \) and \(\nu \). Then
if \(\mu \perp \nu \) then \(\mu \otimes \kappa \perp \nu \otimes \eta \),
\(\mu \otimes \kappa \perp \nu \otimes \eta \iff (\mu \sqcap \nu ) \otimes \kappa \perp (\mu \sqcap \nu ) \otimes \eta \), and the same holds for any measure which is equivalent to \(\mu \sqcap \nu \), like \(\frac{d \mu }{d \nu } \cdot \nu \) ,
if \(\mu \otimes \kappa \perp \nu \otimes \eta \) then for \((\mu \sqcap \nu )\)-almost every \(x\), \(\kappa (x) \perp \eta (x)\).
For probability measures,
Let \(\mu , \nu , \xi \) be three probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha , \beta \ge 0\). Then
Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is continuous on \([0, 1]\) and on \([0, \sup \{ \alpha \mid R_\alpha (\mu , \nu ) {\lt} \infty \} )\).
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(R_\alpha (\mu , \nu ) \ge R_\alpha (\mu _E, \nu _E)\).
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(\alpha \in (0, 1)\). Then
The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\). Let \(\pi _\alpha = (\alpha , 1 - \alpha ) \in \mathcal P(\{ 0,1\} )\) and let \(P : \{ 0,1\} \rightsquigarrow \mathcal X\) be the kernel with \(P(0) = \mu \) and \(P(1) = \nu \). Then
The infimum is attained at \(\xi = \mu ^{(\alpha , \nu )}\).
For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then
For \(\alpha \in (0,1)\cup (1, \infty )\) and finite measures \(\mu , \nu \), if \(\left(\frac{d \mu }{d \nu }\right)^\alpha \) is integrable with respect to \(\nu \) and \(\mu \ll \nu \) then
Let \(\mu , \nu \) be two probability measures. Then \(R_{1/2}(\mu , \nu ) = -2\log (1 - \operatorname{H^2}(\mu , \nu ))\).
Let \(\mu , \nu \) be two finite measures. Then \(\alpha \mapsto R_\alpha (\mu , \nu )\) is nondecreasing on \([0, + \infty )\).
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\). Let \(n \in \mathbb {N}\) and write \(\mu ^{\otimes n}\) for the product measure on \(\mathcal X^n\) of \(n\) times \(\mu \). Then \(R_\alpha (\mu ^{\otimes n}, \nu ^{\otimes n}) = n R_\alpha (\mu , \nu )\).
Dummy node to summarize properties of the Rényi divergence.
Let \(\mu , \nu \) be two finite measures. \(R_1(\mu , \nu ) = \lim _{\alpha \uparrow 1} R_\alpha (\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures such that there exists \(\alpha {\gt} 1\) with \(R_\alpha (\mu , \nu )\) finite. Then \(R_1(\mu , \nu ) = \lim _{\alpha \downarrow 1} R_\alpha (\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures. \(R_0(\mu , \nu ) = \lim _{\alpha \downarrow 0} R_\alpha (\mu , \nu )\).
For \(\alpha \in (0,1)\), \(\mu ^{(\alpha , \nu )} \ll \mu \) and \(\mu ^{(\alpha , \nu )} \ll \nu \).
For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \((\mu ^{(\alpha , \nu )})_\tau = (\mu _\tau )^{(\alpha , \nu _\tau )}\).
For any measurable space \(\mathcal X\), let \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\) be the Markov kernel to the point space. For all Markov kernels \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\),
Let \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) such that \(\kappa _\mu ^\dagger \) exists and there exists \(\nu \in \mathcal M(\mathcal X)\) such that \(\kappa _\mu ^\dagger \). Then for \(\mu \)-almost all \(x\) and \((\kappa \circ \mu )\)-almost all \(y\),
Let \(\mu , \nu , \xi \) be \(\sigma \)-finite measures on \(\mathcal X\).
If \(\mu \ll \nu \) then \(\xi \)-almost surely, \(\frac{d \mu }{d \xi } = \frac{d \mu }{d \nu } \frac{d \nu }{d \xi }\).
If \(\nu \ll \xi \) then \(\nu \)-almost surely, \(\frac{d \mu }{d \xi } = \frac{d \mu }{d \nu } \frac{d \nu }{d \xi }\).
Let \(\mu , \nu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be finite kernels with \(\kappa (x) \ll \eta (x)\) \(\nu \)-a.e.. Let \(\mathcal B\) be the sigma-algebra on \(\mathcal X \times \mathcal Y\) obtained by taking the comap of the sigma-algebra of \(\mathcal Y\) by the projection. Then for \((\nu \otimes \eta )\)-almost every \((x,y)\),
Let \(\mu , \nu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a finite kernel. Let \(\mathcal B\) be the sigma-algebra on \(\mathcal X \times \mathcal Y\) obtained by taking the comap of the sigma-algebra of \(\mathcal Y\) by the projection. Then for \((\nu \otimes \kappa )\)-almost every \((x,y)\),
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then for \((\nu \otimes \eta )\)-almost all \((x, y)\),
This implies that the equality is true for \(\nu \)-almost all \(x\), for \(\eta (x)\)-almost all \(y\).
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Let \(\mu ' = \left(\frac{\partial \mu }{\partial \nu }\right) \cdot \nu \) and \(\kappa ' = \left(\frac{\partial \kappa }{\partial \eta }\right) \cdot \eta \). Then for \((\nu \otimes \eta )\)-almost all \(z\), \(\frac{\partial (\mu ' \otimes \kappa ')}{\partial (\nu \otimes \eta )}(z) = \frac{\partial (\mu \otimes \kappa )}{\partial (\nu \otimes \eta )}(z)\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Let \(\mu _{\parallel \nu } = \left(\frac{\partial \mu }{\partial \nu }\right) \cdot \nu \). Then for \((\nu \otimes \eta )\)-almost all \(z\), \(\frac{d (\mu _{\parallel \nu } \otimes \kappa )}{d (\nu \otimes \eta )}(z) = \frac{d (\mu \otimes \kappa )}{d (\nu \otimes \eta )}(z)\).
Let \(\mu , \nu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \), \(g : \mathcal X \to \mathcal Y\) a measurable function and denote by \(g^* \mathcal Y\) the comap of the \(\sigma \)-algebra on \(\mathcal Y\) by \(g\). Then \(\nu \)-almost everywhere,
Let \(\mu , \nu \in \mathcal M(\mathcal X)\) with \(\mu \ll \nu \), \(g : \mathcal X \to \mathcal Y\) a measurable function and denote by \(g^* \mathcal Y\) the comap of the \(\sigma \)-algebra on \(\mathcal Y\) by \(g\). Then \(\nu \)-almost everywhere,
\(\frac{d \mu ^{(\alpha , \nu )}}{d \nu } = \left(\frac{d\mu }{d\nu }\right)^\alpha e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\nu \)-a.e., and \(\frac{d \mu ^{(\alpha , \nu )}}{d \mu } = \left(\frac{d\nu }{d\mu }\right)^{1 - \alpha } e^{-(\alpha - 1) R_\alpha (\mu , \nu )}\) , \(\mu \)-a.e..
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) with \(\mu \ll \nu \) and let \(\mathcal A\) be a sub-\(\sigma \)-algebra of \(\mathcal X\). Then \(\frac{d \mu _{| \mathcal A}}{d \nu _{| \mathcal A}}\) is \(\nu _{| \mathcal A}\)-almost everywhere (hence also \(\nu \)-a.e.) equal to \(\nu \left[ \frac{d \mu }{d \nu } \mid \mathcal A\right]\).
Let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. If for some \(f\) and \(\xi \), \(\kappa = f \cdot \eta + \xi \) with \(\xi (x) \perp \eta (x)\) for all \(x\), then for all \(x\), \(f(x, y) = \frac{d \kappa (x)}{d \eta (x)}(y)\) for \(\eta (x)\)-almost all \(y \in \mathcal Y\).
Let \(\mu , \nu \) be two \(\sigma \)-finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two s-finite kernels. We denote \(\frac{d\mu }{d\nu }\cdot \nu \) by \(\mu _{\parallel \nu }\). Then
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),
Dummy node to summarize properties of the statistical information.
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and \(E\) an event. Let \(\alpha \in (0,1)\). Then
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \(E\) be an event on \(\mathcal X\). Let \(\alpha {\gt} 0\). Then
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\), let \(n \in \mathbb {N}\) and let \(E\) be an event on \(\mathcal X^n\). For all \(\alpha {\gt} 0\),
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(E\) be an event. Let \(\mu _E\) and \(\nu _E\) be the two Bernoulli distributions with respective means \(\mu (E)\) and \(\nu (E)\). Then \(\operatorname{TV}(\mu , \nu ) \ge \operatorname{TV}(\mu _E, \nu _E)\).
On probability measures, the total variation distance \(\operatorname{TV}\) is an \(f\)-divergence for the function \(x \mapsto \frac{1}{2}\vert x - 1 \vert \).
For finite measures \(\mu , \nu \),
For finite measures \(\mu , \nu \),
For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
For finite measures \(\mu , \nu \),
For finite measures \(\mu , \nu \), for any measure \(\zeta \) with \(\mu \ll \zeta \) and \(\nu \ll \zeta \) ,
This holds in particular for \(\zeta = P \circ \xi \).
Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right) \le \operatorname{TV}(\mu , \nu )\).
Let \(\mu , \nu \) be two probability measures. Then \(\operatorname{TV}(\mu , \nu ) \le \sqrt{\operatorname{H^2}(\mu , \nu )(2 - \operatorname{H^2}(\mu , \nu ))}\).
Dummy node to summarize properties of the total variation distance.
The Bayes risk of simple binary hypothesis testing for prior \(\xi \in \mathcal M(\{ 0,1\} )\) is
Let \(\mu , \nu \in \mathcal P(\mathcal X)\). If \(f(1) = 0\),
If either \(\mathcal X\) is countable of \(\mathcal Y\) has a countably generated \(\sigma \)-algebra, and \(\mathcal Z\) is standard Borel, then for every s-finite kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y \times \mathcal Z\), there exists a Markov kernel \(\kappa _{Z \mid Y} : \mathcal X \times \mathcal Y \rightsquigarrow \mathcal Z\) that disintegrates \(\kappa \).
Let \(\mu \) be a measure on a standard Borel space \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels, such that \(\kappa (x) \ne 0\) for all \(x\). Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\)
Let \(\mu \) be a finite measure \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels. Then \(D_f(\kappa \circ \mu , \eta \circ \mu ) \le D_f(\mu \otimes \kappa , \mu \otimes \eta ) = D_f(\kappa , \eta \mid \mu )\) .
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel, where both \(\mathcal X\) and \(\mathcal Y\) are standard Borel. Then \(D_f(\kappa \circ \mu , \kappa \circ \nu ) \le D_f(\mu , \nu )\).
Suppose that \(f(1) = g(1) = 0\) and that \(f'(1) = g'(1) = 0\), and that \(g{\gt}0\) on \((0,1) \cup (1, +\infty )\). Suppose that the space \(\mathcal X\) has at least two disjoint non-empty measurable sets. Then
Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\) where \(\mathcal Y\) is standard Borel, and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mathcal X\) standard Borel and \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).
Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(D_f(\mu _X, \nu _X) \le D_f(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(D_f(\mu _Y, \nu _Y) \le D_f(\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \(D_f(\mu , \nu ) \le D_f(\mu \otimes \kappa , \nu \otimes \eta )\).
Let \(\alpha {\gt} 0\), \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a Markov kernel. Then \(\operatorname{H}_\alpha (\kappa \circ \mu , \kappa \circ \nu ) \le \operatorname{H}_\alpha (\mu , \nu )\).
If \(f\) and \(g\) are two Stieltjes functions with associated measures \(\mu _f\) and \(\mu _g\) and \(f\) is continuous on \([a, b]\), then
When the generalized Bayes estimator is well defined, it is a Bayes estimator. The value of the Bayes risk with respect to the prior \(\pi \in \mathcal M(\Theta )\) is then
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\). Then \(\sup _{\mathcal A \text{ finite}} D_f(\mu _{| \mathcal A}, \nu _{| \mathcal A}) = D_f(\mu , \nu )\).
Two kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) are equal iff for all measurable functions \(f : \mathcal Y \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X\),
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels, with either \(\mathcal X\) countable or \(\mathcal{Y}\) countably generated. Then \(\operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\kappa , \eta \mid \mu )\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels. Then \(\operatorname{KL}(\mu \otimes \kappa , \nu \otimes \eta ) = \operatorname{KL}(\mu , \nu ) + \operatorname{KL}(\mu \otimes \kappa , \mu \otimes \eta )\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels with Bayesian inverses \(\kappa _\mu ^\dagger \) and \(\eta _\nu ^\dagger \). Then
Let \(\mu \) and \(\nu \) be two measures on \(\mathcal X \times \mathcal Y\), and let \(\mu _X, \nu _X\) be their marginals on \(\mathcal X\). Then \(\operatorname{KL}(\mu _X, \nu _X) \le \operatorname{KL}(\mu , \nu )\). Similarly, for \(\mu _Y, \nu _Y\) the marginals on \(\mathcal Y\), \(\operatorname{KL}(\mu _Y, \nu _Y) \le \operatorname{KL}(\mu , \nu )\).
Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on spaces \((\mathcal X_i)_{i \in I}\). Then
For \(\mu _1\) a probability measure on \(\mathcal X\), \(\nu _1\) a finite measure on \(\mathcal{X}\) and \(\mu _2, \nu _2\) two probability measures on \(\mathcal Y\),
For \(\mu , \nu \in \mathcal P(\mathcal X)\), \(\operatorname{KL}(\mu _\tau , \nu _\tau ) = \mu [\tau ] \operatorname{KL}(\mu , \nu )\).
For \(\alpha , \beta \in (0, 1/2)\),
As a consequence,
In particular, \(\log \frac{\alpha }{B_\alpha (\mu , \nu )} \le R_{1/2}(\mu , \nu ) + \log 2\) .
For \(\rho \in \mathcal M(\mathcal X \times \mathcal Y)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal X'\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Y'\) two Markov kernels,
If the divergence \(D\) satisfies the data-processing inequality, then for all \(\mu \in \mathcal M(\mathcal X)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and all Markov kernels \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) then
Let \(\mu , \nu \) be two measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two Markov kernels. Then \(R_\alpha (\mu \otimes \kappa , \nu \otimes \eta ) = R_\alpha (\mu , \nu ) + R_\alpha (\kappa , \eta \mid \mu ^{(\alpha , \nu )})\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) two Markov kernels with Bayesian inverses \(\kappa _\mu ^\dagger \) and \(\eta _\nu ^\dagger \). Then
Let \(I\) be a finite index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).
Let \(I\) be a countable index set. Let \((\mu _i)_{i \in I}, (\nu _i)_{i \in I}\) be probability measures on measurable spaces \((\mathcal X_i)_{i \in I}\). Then \(R_\alpha (\prod _{i \in I} \mu _i, \prod _{i \in I} \nu _i) = \sum _{i \in I} R_\alpha (\mu _i, \nu _i)\).
For \(\mu , \nu \) two probability measures on \(\mathcal X\) and \(\alpha \in (0,1)\), \(R_\alpha (\mu _\tau , \nu _\tau ) = \mu ^{\alpha , \nu }[\tau ] R_\alpha (\mu , \nu )\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) and let \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) be two finite kernels with \(\mu \otimes \kappa \ll \mu \otimes \eta \). Then for \((\nu \otimes \eta )\)-almost all \((x,y)\),
For finite measures \(\mu , \nu \) and \(\xi \in \mathcal M(\{ 0,1\} )\),
Let \(\mu , \nu \in \mathcal P(\mathcal X)\) and let \(\alpha \in (0, 1)\).
in which \(h_2: x \mapsto x\log \frac{1}{x} + (1 - x)\log \frac{1}{1 - x}\) is the binary entropy function.
Let \(\mu , \nu \) be two probability measures on \(\mathcal X\) and let \((E_n)_{n \in \mathbb {N}}\) be events on \(\mathcal X^n\). For all \(\gamma \in (0,1)\),
Let \(\mathcal F = \{ f : \mathcal X \to \mathbb {R} \mid \Vert f \Vert _\infty \le 1\} \). Then for \(\mu , \nu \) finite measures with \(\mu (\mathcal X) = \nu (\mathcal X)\), \(\operatorname{TV}(\mu , \nu ) = \frac{1}{2} \sup _{f \in \mathcal F} \left( \mu [f] - \nu [f] \right)\).
Let \(\mu , \nu \) be two finite measures on \(\mathcal X\) with \(\mu (\mathcal X) \leq \nu (\mathcal X)\).
Then \(TV(\mu , \nu ) = \sup _{E \text{ event}} \left( \mu (E) - \nu (E) \right)\).