A Probability transition kernels
Consider, informally, a process where we first observe a random state (for example the result of a coin flip), and then, depending on the value of that first state, we produce another random value. For example, if the coin lands on heads, we roll two dice and observe the sum. If on the other hand the coin lands on tails, we roll only one die and observe the result. The law of the second observation is a function of the first observation. This function from one space to measures on an other one is a transition kernel. To get an usable definition for general measurable spaces, we need to enforce that the function is measurable.
A.1 Definition
Let \(\mathcal B\) be the Borel \(\sigma \)-algebra on \(\mathbb {R}_{+,\infty }\). Let \(\mathcal X\) be a measurable space. For a measurable set \(s\) of \(\mathcal X\), let \((\mu \mapsto \mu (s))^* \mathcal B\) be the \(\sigma \)-algebra on \(\mathcal M(\mathcal X)\) defined by the comap of the evaluation function at \(s\). Then \(\mathcal M(\mathcal X)\) is a measurable space with \(\sigma \)-algebra \(\bigsqcup _{s} (\mu \mapsto \mu (s))^* \mathcal B\) where the supremum is over all measurable sets \(s\).
It then makes sense to talk about a measurable function from a measurable space \(\mathcal X\) to the measures on another measurable space \(\mathcal Y\). We now write more explicitly what it means for such a function to be measurable.
Let \(\mathcal X, \mathcal Y\) be measurable spaces, and let \(f : \mathcal X \to \mathcal M(\mathcal Y)\) such that for all measurable sets \(s\) of \(\mathcal X\), the function \(x \mapsto f(x)(s)\) is measurable. Then \(f\) is measurable.
We now have the tools needed to define probability transition kernels.
Let \(\mathcal X, \mathcal Y\) be two measurable spaces. A probability transition kernel (or simply kernel) from \(\mathcal X\) to \(\mathcal Y\) is a measurable map from \(\mathcal X\) to \(\mathcal M (\mathcal Y)\), the measurable space of measures on \(\mathcal Y\). We write \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) for a kernel \(\kappa \) from \(\mathcal X\) to \(\mathcal Y\).
Measures and measurable functions are special cases of kernels. A measure on \(\mathcal Y\) can be seen as a constant kernel from any space to \(\mathcal Y\) (for example from a space with 1 element to \(\mathcal Y\)). A measurable function can be represented by a deterministic kernel.
The deterministic kernel defined by a measurable function \(f : \mathcal X \to \mathcal Y\) is the kernel \(d_f: \mathcal X \rightsquigarrow \mathcal Y\) defined by \(d_f(x) = \delta _{f(x)}\), where for any \(y \in \mathcal Y\), \(\delta _y\) is the Dirac probability measure at \(y\).
A few deterministic kernels are particularly useful.
The identity kernel, deterministic kernel given by the identity function.
The copy kernel \(c_{\mathcal X} : \mathcal X \rightsquigarrow \mathcal X \times \mathcal X\), from the function \(x \mapsto (x, x)\).
The discard kernel \(d_{\mathcal X}\) (TODO: conflicts with deterministic kernel notation). This is a kernel from \(\mathcal X\) to \(*\), the space with one point, corresponding to the constant function.
Kernels can be described by the integrals of measurable functions, as demonstrated in the following extensionality theorem.
Two kernels \(\kappa , \eta : \mathcal X \rightsquigarrow \mathcal Y\) are equal iff for all measurable functions \(f : \mathcal Y \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X\),
A.2 Classes of kernels
A kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is said to be finite if there exists \(C {\lt} +\infty \) such that for all \(x \in \mathcal X\), \(\kappa (x)(\mathcal Y) \le C\).
Note the uniform bound: it is not enough to have \(\kappa (x)(\mathcal Y) {\lt} \infty \) for all \(x \in \mathcal X\). TODO: why? This is required for composition of finite kernels to still be finite (iirc): be precise about that.
A kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is said to be a Markov kernel if for all \(x \in \mathcal X\), \(\kappa (x)\) is a probability measure.
A Markov kernel is finite, with constant \(C = 1\).
A kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) is said to be a s-finite if it is equal to a countable sum of finite kernels.
That last, most general class of kernels, is the right assumption to define operations on kernels like their product.
A.3 Composition and product
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : (\mathcal X \times \mathcal Y) \rightsquigarrow \mathcal Z\) be two s-finite kernels. The composition-product of \(\kappa \) and \(\eta \) is a kernel \(\kappa \otimes \eta : \mathcal X \rightsquigarrow (\mathcal Y \times \mathcal Z)\) such that for all measurable functions \(f : \mathcal Y \times \mathcal Z \to \mathbb {R}_{+,\infty }\) and \(x \in \mathcal X\) ,
As for all other kernel definitions in this section, we omit the proof that this indeed defines a kernel and refer the reader to the Mathlib implementation.
TODO: explanation of that composition-product on an example.
The name composition-product is not standard in the literature, where it can be called either composition or product depending on the source. We reserve the name composition for an operation that from kernels \(\mathcal X \rightsquigarrow \mathcal Y\) and \(\mathcal Y \rightsquigarrow \mathcal Z\) produces a kernel \(\mathcal X \rightsquigarrow \mathcal Z\), and the name product for the one that takes kernels \(\mathcal X \rightsquigarrow \mathcal Y\) and \(\mathcal X \rightsquigarrow \mathcal Z\) and produces a kernel \(\mathcal X \rightsquigarrow (\mathcal Y \times \mathcal Z)\). Both are special cases of the composition-product, which justifies that name.
The composition-product is defined for \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X \times \mathcal Y \rightsquigarrow \mathcal Z\), but since a kernel \(\mathcal Y \rightsquigarrow \mathcal Z\) can be seen as a kernel \(\mathcal X \times \mathcal Y \rightsquigarrow \mathcal Z\) independent of the first coordinate, we abuse notation and also write \(\kappa \otimes \xi \) for \(\xi : \mathcal Y \rightsquigarrow \mathcal Z\).
Since a measure is a constant kernel from the space with one element, we can also define the composition-product of a measure and a kernel. Note that this definition uses the abuse of notation we just introduced.
Let \(\mu \in \mathcal M(\mathcal X)\) be an s-finite measure and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be an s-finite kernel. Let \(\mathcal U\) be a measurable space with a unique element \(u\). Let \(\mu _k : \mathcal U \rightsquigarrow \mathcal X\) be the constant kernel with value \(\mu \). The composition-product of \(\mu \) and \(\kappa \) is the measure on \(\mathcal M(\mathcal X \times \mathcal Y)\) defined by \((\mu _k \otimes \kappa ) (u)\) .
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) be two kernels. The composition of \(\kappa \) and \(\eta \) is the kernel \(\eta \circ \kappa : \mathcal X \rightsquigarrow \mathcal Z\) such that for all measurable functions \(f : \mathcal Z \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X\),
We can also define the composition of a measure and a kernel, by \(\kappa \circ \mu = (\kappa \circ \mu _k) (u)\), where \(\mathcal U\) is a measurable space with a unique element \(u\) and \(\mu _k : \mathcal U \rightsquigarrow \mathcal X\) is the constant kernel with value \(\mu \).
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X \rightsquigarrow \mathcal Z\) be two s-finite kernels. The product of \(\kappa \) and \(\eta \) is the kernel \(\kappa \times \eta : \mathcal X \rightsquigarrow \mathcal Y \times \mathcal Z\) such that for all measurable functions \(f : \mathcal Y \times \mathcal Z \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X\),
For \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\) (independent of \(\mathcal X\)), the composition-product can be written as a composition: \(\kappa \otimes \eta = (d_{\text{id}} \times \eta ) \circ \kappa \) where \(d_{\text{id}}\) is the deterministic kernel for the identity. In particular, the composition-product of a measure and a kernel \(\kappa \) is the composition of that measure with the kernel \(d_{\text{id}} \times \kappa \).
Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X' \rightsquigarrow \mathcal Y'\) be two s-finite kernels. The parallel product of \(\kappa \) and \(\eta \) is the kernel \(\kappa \parallel \eta : \mathcal X \times \mathcal X' \rightsquigarrow \mathcal Y \times \mathcal Y'\) such that for all measurable functions \(f : \mathcal Y \times \mathcal Y' \to \mathbb {R}_{+,\infty }\) and all \(x \in \mathcal X \times \mathcal X'\),
A.4 Copy-discard and Markov categories
Measurable spaces and s-finite kernels are a copy-discard category (and in particular a symmetric monoidal category) with the following data:
Objects: measurable spaces
Morphisms: s-finite kernels, with identity kernels and composition of kernels
Tensor product of objects: product of measurable spaces \(\mathcal X \times \mathcal Y\)
Tensor product of morphisms: parallel product \(\kappa \parallel \eta \)
Unit: measurable space with one element (denoted by \(*\))
The associator, left and right unitors and braiding are given by the deterministic kernels of the corresponding equivalences between measurable spaces
Copy morphism: \(c_{\mathcal X}: \mathcal X \rightsquigarrow \mathcal X \times \mathcal X\)
Delete morphism: \(d_{\mathcal X} : \mathcal X \rightsquigarrow *\)
If we restrict the morphisms to Markov kernels, we get a Markov category (TODO ref), which is a copy-discard category in which for every morphism \(\kappa \), \(d_{\mathcal Y} \circ \kappa = d_{\mathcal X}\).
In that view of transition kernels, the following operations are not primitive, but derived from others:
Product: for \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X \rightsquigarrow \mathcal Z\), \(\kappa \times \eta = (\kappa \parallel \eta ) \circ c_{\mathcal X}\)
Composition-product: for \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X \times \mathcal Y \rightsquigarrow \mathcal Z\), \(\kappa \otimes \eta = \mathrm{swap_{\mathcal Z, \mathcal Y}} \circ (\eta \parallel \mathrm{id}) \circ \alpha _{\mathcal X, \mathcal Y, \mathcal Y} \circ (\mathrm{id} \parallel c_{\mathcal Y}) \circ (\mathrm{id} \parallel \kappa ) \circ c_{\mathcal X}\) (where \(\alpha \) is the associator and \(\mathrm{swap}\) is the braiding).
A.5 Ordering
We define a partial order on kernels by the following. Let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal X \rightsquigarrow \mathcal Z\) (with same domain as \(\kappa \)). Then \(\kappa \) is Blackwell sufficient for \(\eta \), denoted by \(\eta \le _B \kappa \), if there exists a Markov kernel \(\xi : \mathcal Y \rightsquigarrow \mathcal Z\) such that \(\eta = \xi \circ \kappa \).
A.6 Disintegration
A kernel \(\eta : \mathcal X \times \mathcal Y \rightsquigarrow \mathcal Z\) is said to disintegrate a kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y \times \mathcal Z\) if \(\kappa = \kappa _{Y} \otimes \eta \).
TODO: if a disintegrating kernel \(\eta \) exits for \(\kappa \), the measure \(\eta (x,y)\) is unique for all \(x \in \mathcal X\) and \(\kappa _{Y}(x)\)-almost every \(y \in \mathcal Y\).
TODO: if a disintegrating kernel \(\eta \) exits for \(\kappa \), the measure \(\eta (x,y)\) is a probability measure for all \(x \in \mathcal X\) and \(\kappa _{Y}(x)\)-almost every \(y \in \mathcal Y\).
If either \(\mathcal X\) is countable of \(\mathcal Y\) has a countably generated \(\sigma \)-algebra, and \(\mathcal Z\) is standard Borel, then for every s-finite kernel \(\kappa : \mathcal X \rightsquigarrow \mathcal Y \times \mathcal Z\), there exists a Markov kernel \(\kappa _{Z \mid Y} : \mathcal X \times \mathcal Y \rightsquigarrow \mathcal Z\) that disintegrates \(\kappa \).
TODO: the result in Mathlib is about finite kernels, but s-finite should be enough?
TODO: as a consequence, in standard Borel spaces we also have disintegration of measures.
TODO: find a way to hide the following dummy lemma
Dummy node to summarize kernel properties.
A.7 Bayesian inverse of a kernel
For \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\), a Bayesian inverse of \(\kappa \) is a Markov kernel \(\kappa _\mu ^\dagger : \mathcal Y \rightsquigarrow \mathcal X\) such that \(\mu \otimes \kappa = ((\kappa \circ \mu ) \otimes \kappa _\mu ^\dagger )_\leftrightarrow \) in which \((\cdot )_\leftrightarrow \) denotes swapping the two coordinates. If such an inverse exists it is unique up to a \((\kappa \circ \mu )\)-null set, and we talk about the Bayesian inverse of \(\kappa \) with respect to \(\mu \).
The bayesian interpretation of this definition is that \(\mu \) represents a prior over a parameter and \(\kappa \) gives the distribution of data given the parameter. \(\kappa \circ \mu \) is the distribution of the data, and \(\kappa _\mu ^\dagger \) gives the posterior distribution of the parameter given the data.
The measure \(\rho = \mu \otimes \kappa \) is a joint probability distribution over \(\mathcal X \times \mathcal Y\) with marginals \(\rho _X = \mu \) and \(\rho _Y = \kappa \circ \mu \) (if \(\kappa \) is Markov). By construction, that joint probability distribution can be written as the composition-product of the first marginal \(\rho _X\) and a kernel \(\kappa \). The Bayesian inverse, if it exists, gives a way to write the joint as a composition-product of the second marginal \(\rho _Y\) and a kernel.
If the joint space is such that every product distribution can be disintegrated (see Theorem A.6.1), then the Bayesian inverse exists.
For \(\mathcal X\) standard Borel, \(\mu \) and \(\kappa \) s-finite, the Bayesian inverse of \(\kappa \) with respect to \(\mu \) exists and is obtained by disintegration of the measure \(\mu \otimes \kappa \) on \(\mathcal X \times \mathcal Y\) into a measure \(\kappa \circ \mu \in \mathcal M(\mathcal Y)\) and a Markov kernel \(\kappa _\mu ^\dagger : \mathcal Y \rightsquigarrow \mathcal X\).
See [ CDDG17 ] and [ DSDG18 ] for a category theory point of view on Bayesian inversion.
Let \(\mu \in \mathcal M(\mathcal X)\) be a finite measure and let \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) be a finite kernel. If \(\eta : \mathcal Y \rightsquigarrow \mathcal X\) is such that \(\mu \otimes \kappa = ((\kappa \circ \mu ) \otimes \eta )_\leftrightarrow \), then \(\eta (y) = \kappa _\mu ^\dagger (y)\) for \((\kappa \circ \mu )\)-almost all \(y\).
TODO: uniqueness of the conditional kernel.
For \(\mu \in \mathcal M(\mathcal X)\) s-finite, \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel and \(\kappa _\mu ^\dagger \) the Bayesian inverse of \(\kappa \) with respect to \(\mu \), these objects satisfy the equality \(\kappa _\mu ^\dagger \circ \kappa \circ \mu = \mu \).
The measure \(\kappa _\mu ^\dagger \circ (\kappa \circ \mu )\) is the projection on \(\mathcal X\) of \((\kappa \circ \mu ) \otimes \kappa _\mu ^\dagger \). But \((\kappa \circ \mu ) \otimes \kappa _\mu ^\dagger = (\mu \otimes \kappa )_\leftrightarrow \) by definition of the Bayesian inverse. Since \(\kappa \) is a Markov kernel, the projection on \(\mathcal X\) of that measure is simply \(\mu \).
For \(\mathcal X\) and \(\mathcal Y\) two standard Borel spaces, \(\mu \in \mathcal M(\mathcal X)\) s-finite and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) a Markov kernel, \((\kappa _\mu ^\dagger )_{\kappa \circ \mu }^\dagger = \kappa \) (\(\mu \)-a.e.).
By uniqueness of the disintegration, it suffices to show that \((\kappa \circ \mu ) \otimes \kappa _\mu ^\dagger = ((\kappa _\mu ^\dagger \circ \kappa \circ \mu ) \otimes \kappa )_\leftrightarrow \) . By Lemma A.7.3, \(\kappa _\mu ^\dagger \circ \kappa \circ \mu = \mu \) and we need to prove \((\kappa \circ \mu ) \otimes \kappa _\mu ^\dagger = (\mu \otimes \kappa )_\leftrightarrow \) . This is true by definition of \(\kappa _\mu ^\dagger \) .
Let \(\mu \in \mathcal M (\mathcal X)\) and let \(\textup{id} : \mathcal X \rightsquigarrow \mathcal X\) be the identity kernel. Then \(\textup{id}_\mu ^\dagger = \textup{id}\).
It suffices to show that \(\mu \otimes \textup{id} = ((\textup{id} \circ \mu ) \otimes \textup{id})_\leftrightarrow \) , which is clear.
Let \(\mu \in \mathcal M(\mathcal X)\), \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) and \(\eta : \mathcal Y \rightsquigarrow \mathcal Z\). Then \((\eta \circ \kappa \circ \mu )\)-a.e.,
It suffices to show that \(\mu \otimes (\eta \circ \kappa ) = ((\eta \circ \kappa \circ \mu ) \otimes (\kappa _{\mu }^\dagger \circ \eta _{\kappa \circ \mu }^\dagger ))_\leftrightarrow \).
TODO: transcribe the code.
Let \(\mu \in \mathcal M(\mathcal X)\) and \(\kappa : \mathcal X \rightsquigarrow \mathcal Y\) such that \(\kappa _\mu ^\dagger \) exists and there exists \(\nu \in \mathcal M(\mathcal X)\) such that \(\kappa _\mu ^\dagger \). Then for \(\mu \)-almost all \(x\) and \((\kappa \circ \mu )\)-almost all \(y\),
\(\mu \)-almost all \(x\) and \((\kappa \circ \mu )\)-almost all \(y\) is the same as \((\mu \times (\kappa \circ \mu ))\)-almost all \((x, y)\). It suffices to show that for all measurable sets of \(\mathcal X \times \mathcal Y\), the integrals with respect to that product measure of the two sides of the equality on the set are equal. Furthermore, it is sufficient to show it for sets of the form \(s \times t\) since those are a \(\pi \)-system.
TODO: do we have \(\kappa _\mu ^\dagger (y) \ll \mu \) and \(\kappa (x) \ll \kappa \circ \mu \) a.e.?
By the defining property of the Bayesian inverse, those two quantities are equal.
TODO: find a way to hide the following dummy lemma
Dummy node to summarize properties of the Bayesian inverse.