LeanBandits

2 Iterative stochastic algorithms

Definition 2.1 Algorithm
#

A sequential, stochastic algorithm with actions in a measurable space \(\mathcal{A}\) and observations in a measurable space \(\mathcal{R}\) is described by the following data:

  • for all \(t \in \mathbb {N}\), a policy \(\pi _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow \mathcal{A}\), a Markov kernel which gives the distribution of the action of the algorithm at time \(t+1\) given the history of previous pulls and observations,

  • \(P_0 \in \mathcal{P}(\mathcal{A})\), a probability measure that gives the distribution of the first action.

After the algorithm takes an action, the environment generates an observation according to a Markov kernel \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\).

Definition 2.2 Environment
#

An environment with which an algorithm interacts is described by the following data:

  • for all \(t \in \mathbb {N}\), a feedback \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel which gives the distribution of the observation at time \(t+1\) given the history of previous pulls and observations, and the action of the algorithm at time \(t+1\),

  • \(\nu '_0 \in \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel that gives the distribution of the first observation given the first action.

Let’s detail four examples of interactions between an algorithm and an environment.

  1. First order optimization. The objective of the algorithm is to find the minimum of a function \(f : \mathbb {R}^d \to \mathbb {R}\). The action space is \(\mathcal{A} = \mathbb {R}^d\) (a point on which the function will be queried) and the observation space is \(\mathcal{R} = \mathbb {R} \times \mathbb {R}^d\). The environment is described by a function \(g : \mathbb {R}^d \to \mathbb {R} \times \mathbb {R}^d\) such that for all \(x \in \mathbb {R}^d\), \(g(x) = (f(x), \nabla f(x))\). That is, the kernel \(\nu _t\) is deterministic and depends only on the action: it is given by \(\nu _t(h_t, x) = \delta _{g(x)}\) (the Dirac measure at \(g(x)\)). An example of algorithm is gradient descent with fixed step size \(\eta {\gt} 0\): this is a deterministic algorithm defined by \(P_0 = \delta _{x_0}\) for some initial point \(x_0 \in \mathbb {R}^d\) and for all \(t \in \mathbb {N}\), \(\pi _t(h_t) = \delta _{x_{t+1}}\) where \(x_{t+1} = x_t - \eta \nabla g_2(x_t)\).

  2. Stochastic bandits. The action space is \(\mathcal{A} = [K]\) for some \(K \in \mathbb {N}\) (the set of arms) and the observation space is \(\mathcal{R} = \mathbb {R}\) (the reward obtained after pulling an arm). The kernel \(\nu _t\) is stationary and depends only on the action: there are probability distributions \((P_a)_{a \in [K]}\) such that for all \(t \in \mathbb {N}\), for all \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), for all \(a \in \mathcal{A}\), \(\nu _t(h_t, a) = P_a\).

  3. Adversarial bandits. The action space is \(\mathcal{A} = [K]\) for some \(K \in \mathbb {N}\) (the set of arms) and the observation space is \(\mathcal{R} = \mathbb {R}\) (the reward obtained after pulling an arm). The reward kernels are usually taken to be deterministic and in an oblivious adversarial bandit they depend only on the time step: there is a sequence of vectors \((r_t)_{t \in \mathbb {N}}\) in \([0,1]^K\) such that for all \(t \in \mathbb {N}\), for all \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), for all \(a \in \mathcal{A}\), \(\nu _t(h_t, a) = \delta _{r_{t,a}}\) (the Dirac measure at \(r_{t,a}\)).

  4. Reinforcement learnings in Markov decision processes. TODO: main feature is that \(\mathcal{R} = \mathcal{S} \times \mathbb {R}\) where \(\mathcal{S}\) is the state space, and the kernel \(\nu _t\) depends on the last state only.

2.1 Ionescu-Tulcea theorem

If we group together the policy of the algorithm and the kernel of the environment at each time step, we get a sequence of Markov kernels \((\kappa _t)_{t \in \mathbb {N}}\), with \(\kappa _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow (\mathcal{A} \times \mathcal{R})\). We will want to make global probabilistic statements about the whole sequence of actions and observations. For example, we may want to prove that an optimization algorithm converges to the minimum of a function almost surely. For such a statement to make sense, we need a probability space on which the whole sequence of actions and observations is defined as a random variable.

We now abstract that situation and consider a sequence of measurable spaces \((\Omega _t)_{t \in \mathbb {N}}\), a probability measure \(\mu \) on \(\Omega _0\) and a sequence of Markov kernels \(\kappa _t : \prod _{s=0}^t \Omega _s \rightsquigarrow \Omega _{t+1}\). The Ionescu-Tulcea theorem builds a probability space from the sequence of kernels and the initial measure.

Theorem 2.3 Ionescu-Tulcea
#

Let \((\Omega _t)_{t \in \mathbb {N}}\) be a family of measurable spaces. Let \((\kappa _t)_{t \in \mathbb {N}}\) be a family of Markov kernels such that for any \(t\), \(\kappa _t\) is a kernel from \(\prod _{i=0}^t \Omega _{i}\) to \(\Omega _{t+1}\). Then there exists a unique Markov kernel \(\xi : \Omega _0 \rightsquigarrow \prod _{i = 1}^{\infty } \Omega _{i}\) such that for any \(n \ge 1\), \(\pi _{[1,n]*} \xi = \kappa _0 \otimes \ldots \otimes \kappa _{n-1}\). Here \(\pi _{[1,n]} : \prod _{i=1}^{\infty } \Omega _i \to \prod _{i=1}^n \Omega _i\) is the projection on the first \(n\) coordinates.

Proof

The Ionescu-Tulcea theorem in Mathlib [ Mar25 ] actually generates kernels \(\xi _t : \prod _{s=0}^t \Omega _s \rightsquigarrow \prod _{s=0}^{\infty } \Omega _s\) for any \(t\), with the property that the kernels are the identity on the first \(t+1\) coordinates.

Definition 2.4
#

For \(\mu \in \mathcal{P}(\Omega _0)\), we call trajectory measure the probability measure \(\xi _0 \circ \mu \) on \(\Omega _{\mathcal{T}} := \prod _{i=0}^{\infty } \Omega _i\). We denote it by \(P_{\mathcal{T}}\). The \(\mathcal{T}\) subscript stands for “trajectory”.

Definition 2.5
#

For \(t \in \mathbb {N}\), we denote by \(X_t \in \Omega _t\) the random variable describing the time step \(t\), and by \(H_t \in \prod _{s=0}^t \Omega _s\) the history up to time \(t\). Formally, these are measurable functions on \(\Omega _{\mathcal{T}}\), defined by \(X_t(\omega ) = \omega _t\) and \(H_t(\omega ) = (\omega _1, \ldots , \omega _t)\).

Note: \((X_t)_{t \in \mathbb {N}}\) is the canonical process on \(\Omega _{\mathcal{T}}\). \(H_t\) is equal to \(\pi _{[0,t]}\).

We now list properties of those random variables that follow from the construction of the trajectory measure.

We write \(P[X \mid Y]\) for the conditional distribution of a random variable \(X\) given another random variable \(Y\) under a probability measure \(P\).

For any \(t \in \mathbb {N}\), the conditional distribution \(P_{\mathcal{T}}\left[X_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to \(\kappa _t\).

Proof

This is proved through the defining property of the conditional distribution: it is the almost surely unique Markov kernel \(\eta \) such that \(((H_t)_* P_{\mathcal{T}}) \otimes \eta = (H_t, X_{t+1})_*P_{\mathcal{T}}\).

TODO: complete proof.

The law of \(X_0\) under \(P_{\mathcal{T}}\) is \(\mu \).

Proof

Case of a composition-product

We suppose now that \(\Omega _t = \mathcal{A}_t \times \mathcal{R}_t\) for some measurable spaces \(\mathcal{A}_t\) and \(\mathcal{R}_t\), and that for all \(t \in \mathbb {N}\), \(\kappa _t = \pi _t \otimes \nu _t\) for kernels \(\pi _t : \prod _{s=0}^t(\mathcal{A}_s \times \mathcal{R}_s) \rightsquigarrow \mathcal{A}\) and \(\nu _t : \prod _{s=0}^t(\mathcal{A}_s \times \mathcal{R}_s) \times \mathcal{A} \rightsquigarrow \mathcal{R}\). Likewise, \(\mu = \alpha _0 \otimes \nu '_0\) for a probability measure \(\alpha _0\) on \(\mathcal{A}_0\) and a Markov kernel \(\nu '_0 : \mathcal{A}_0 \rightsquigarrow \mathcal{R}_0\). That’s the case of an algorithm interacting with an environment as described above. We write \(A_t\) and \(R_t\) for the projections of \(X_t\) on \(\mathcal{A}_t\) and \(\mathcal{R}_t\) respectively.

For any \(t \in \mathbb {N}\), the conditional distribution \(P_{\mathcal{T}}\left[A_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to \(\pi _t\).

Proof

By Lemma 2.6, \(P_{\mathcal{T}}\left[X_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to \(\kappa _t = \pi _t \otimes \nu _t\). Since \(A_{t+1}\) is the projection of \(X_{t+1}\) on \(\mathcal{A}_{t+1}\), \(P_{\mathcal{T}}\left[A_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to the projection of \(\kappa _t\) on \(\mathcal{A}_{t+1}\), which is \(\pi _t\).

For any \(t \in \mathbb {N}\), the conditional distribution \(P_{\mathcal{T}}\left[R_{t+1} \mid H_t, A_{t+1}\right]\) is \(((H_t, A_{t+1})_* P_{\mathcal{T}})\)-almost surely equal to \(\nu _t\).

Proof

It suffices to show that \(((H_t, A_{t+1})_* P_{\mathcal{T}}) \otimes \nu _t = (H_t, A_{t+1}, R_{t+1})_* P_{\mathcal{T}} = (H_t, X_{t+1})_* P_{\mathcal{T}}\). By Lemma 2.6, \(P_{\mathcal{T}}\left[X_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to \(\kappa _t = \pi _t \otimes \nu _t\). Thus \(((H_t)_* P_{\mathcal{T}}) \otimes (\pi _t \otimes \nu _t) = (H_t, X_{t+1})_* P_{\mathcal{T}}\).

We thus have to prove that \(((H_t)_* P_{\mathcal{T}}) \otimes (\pi _t \otimes \nu _t) = ((H_t, A_{t+1})_* P_{\mathcal{T}}) \otimes \nu _t\).

By Lemma 2.8, \((H_t, A_{t+1})_* P_{\mathcal{T}} = (H_t)_* P_{\mathcal{T}} \otimes \pi _t\), and replacing this in the right-hand side gives the left-hand side (using associativity of the composition-product).

The law of \(A_0\) under \(P_{\mathcal{T}}\) is \(\alpha _0\).

Proof

\(X_0\) has law \(\mu = \alpha _0 \otimes \nu '_0\). \(A_0\) is the projection of \(X_0\) on the first space \(\mathcal{A}_0\) and \(\nu _0'\) is Markov, so \(A_0\) has law \(\alpha _0\).

The conditional distribution \(P_{\mathcal{T}}\left[R_0 \mid A_0\right]\) is \((A_{0*} P_{\mathcal{T}})\)-almost surely equal to \(\nu '_0\).

Proof

To prove almost sure equality, it is enough to prove that \((A_{0*} P_{\mathcal{T}}) \otimes P_{\mathcal{T}}\left[R_0 \mid A_0\right] = (A_{0*} P_{\mathcal{T}}) \otimes \nu '_0\). By definition of the conditional distribution, we have \((A_{0*} P_{\mathcal{T}}) \otimes P_{\mathcal{T}}\left[R_0 \mid A_0\right] = (A_0, R_0)_* P_{\mathcal{T}} = X_{0*} P_{\mathcal{T}}\). By Lemma 2.7, \(X_{0*} P_{\mathcal{T}} = \mu = \alpha _0 \otimes \nu '_0\). By Lemma 2.10, \(A_{0*} P_{\mathcal{T}} = \alpha _0\). Thus the two sides are equal.

2.2 Independence and Markov property

The structure of the sequence of kernels \((\kappa _t)_{t \in \mathbb {N}}\) is reflected in independence properties of the sequence of random variables \((X_t)_{t \in \mathbb {N}}\).