Introduction
The goal of this project is to formalize information divergences between probability measures, as well as results about error bounds for (sequential) hypothesis testing. This is not an introductory text and we will not explain why the objects described here are useful, where they come from or what applications the results might have. For all those considerations, we refer the reader to the recent book [ PW24 ] .
Our goal is to compile properties of information divergences and their uses in hypothesis testing. Most software libraries have tutorials, to learn how to use them, as well as a documentation listing all the available objects and methods of the library: in this project, we write the “documentation” of a part of the “information theory library”.
This document and the Lean implementation
This document is generated from a latex file written by the authors, thanks to the leanblueprint library developed by Patrick Massot (https://github.com/PatrickMassot/leanblueprint). It contains links to a Lean implementation of the results, but there is no technical guarantee that the results written here and the ones in the code match in any way. In particular, nothing in this document is automatically generated from the code. In summary, although the authors strive to make this document as close to the Lean implementation as possible, the only formally verified statements are those written in Lean.
Assumptions
Information divergences are usually considered for probability measures, but we will be more granular about the assumptions we impose on the measures. Since Radon-Nikodym derivatives don’t have good properties unless the measures are sigma-finite and since those derivatives are the basic building block of all the information divergences we consider, every measure we consider will be supposed sigma-finite. Many results additionally need that the measures are finite, and some will be about probability measures specifically.
In most results, the measures can be defined on a general measurable space, without additional assumptions. However some statements, notably those involving conditional divergences, will require the sigma-algebra to be countably generated or the space to be standard Borel. The countably generated assumption means that there is a countable set of sets that generates the sigma-algebra. This is the case for example in any standard Borel space, which covers the vast majority of applications.
Notation
\(\mathcal X\), \(\mathcal Y\) will denote measurable spaces. Additional assumptions such as \(\mathcal X\) being standard Borel will be specified where necessary.
\(\mu \), \(\nu \) denote measures on a space \(\mathcal X\). A measure \(\mu \) is said to be finite if \(\mu (\mathcal X) {\lt} \infty \), to be a probability measure if \(\mu (\mathcal X) = 1\). A measure is s-finite if it can be written as a countable sum of finite measures, and is \(\sigma \)-finite if it can be written as a countable sum of finite measures which are pairwise mutually singular. The set of measures on \(\mathcal X\) is \(\mathcal M(\mathcal X)\). The set of probability measures is denoted by \(\mathcal P(\mathcal X)\).
For a function \(g : \mathcal X \to \mathcal Y\) which is \(a.e.\)-measurable with respect to \(\mu \), \(g_* \mu \) denotes the pushforward or map of \(\mu \) by \(g\). This is a measure on \(\mathcal Y\) such that \(g_* \mu (A) = \mu (g^{-1} A)\). For \(g : \mathcal X \to \mathbb {R}_{+,\infty }\), \(g \cdot \mu \) denotes the measure on \(\mathcal X\) with density \(g\) with respect to \(\mu \). That is, \(g \cdot \mu (A) = \int _{x \in A} g(x) \partial \mu \). The Radon-Nikodym of a measure \(\mu \) with respect to another measure \(\mu \) is denoted by \(\frac{d\mu }{d\nu }\). The singular part of \(\mu \) with respect to \(\mu \) is \(\mu _{\perp \nu }\).
\(\kappa , \eta \) are used to refer to transition kernels, and we write \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) to indicate that \(\kappa \) is a transition kernel from \(\mathcal X\) to \(\mathcal Y\). That is, \(\kappa \) is a measurable map from \(\mathcal X\) to the measures on \(\mathcal Y\). A kernel is finite if there exists \(C {\lt} + \infty \) such that \(\kappa (x, \mathcal Y) \le C\) for all \(x \in \mathcal X\) (note that this is stronger than the condition that all \(\kappa (x)\) are finite since it requires a uniform bound). It is s-finite if it can be written as a countable sum of finite kernels, and is called a Markov kernel if \(\kappa (x)\) is a probability measure for all \(x \in \mathcal X\). See Appendix A for more details about kernels and Appendix B for results about Radon-Nikodym derivatives of kernels.
Some references
A very useful book (draft) about information theory and hypothesis testing: [ PW24 ]
Main reference for the properties of the Rényi divergence: [ VEH14 ]