LeanBandits

2 Iterative stochastic algorithms

Warning: all times start at zero.

TODO: notations

All measurable spaces are assumed to be standard Borel.

Definition 2.1 Algorithm
#

A sequential, stochastic algorithm with actions in a measurable space \(\mathcal{A}\) and observations in a measurable space \(\mathcal{R}\) is described by the following data:

  • for all \(t \in \mathbb {N}\), a policy \(\pi _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow \mathcal{A}\), a Markov kernel which gives the distribution of the action of the algorithm at time \(t+1\) given the history of previous actions and observations,

  • \(P_0 \in \mathcal{P}(\mathcal{A})\), a probability measure that gives the distribution of the first action.

After the algorithm takes an action, the environment generates an observation according to a Markov kernel \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\).

Definition 2.2 Environment
#

An environment with which an algorithm interacts is described by the following data:

  • for all \(t \in \mathbb {N}\), a feedback \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel which gives the distribution of the observation at time \(t+1\) given the history of previous pulls and observations, and the action of the algorithm at time \(t+1\),

  • \(\nu '_0 \in \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel that gives the distribution of the first observation given the first action.

Definition 2.3 Deterministic algorithm
#

An algorithm is deterministic if its initial probability measure \(P_0\) is a Dirac measure and if all its policies \(\pi _t\) are deterministic kernels.

Definition 2.4 Stationary environment
#

An environment is stationary if there exists a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathcal{R}\) such that \(\nu '_0 = \nu \) and for all \(t \in \mathbb {N}\), for all \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), for all \(a \in \mathcal{A}\), \(\nu _t(h_t, a) = \nu (a)\).

TODO: possibly change the “stationary” name.

Let’s detail four examples of interactions between an algorithm and an environment.

  1. First order optimization. The objective of the algorithm is to find the minimum of a function \(f : \mathbb {R}^d \to \mathbb {R}\). The action space is \(\mathcal{A} = \mathbb {R}^d\) (a point on which the function will be queried) and the observation space is \(\mathcal{R} = \mathbb {R} \times \mathbb {R}^d\). The environment is described by a function \(g : \mathbb {R}^d \to \mathbb {R} \times \mathbb {R}^d\) such that for all \(x \in \mathbb {R}^d\), \(g(x) = (f(x), \nabla f(x))\). That is, the kernel \(\nu _t\) is deterministic and depends only on the action: it is given by \(\nu _t(h_t, x) = \delta _{g(x)}\) (the Dirac measure at \(g(x)\)). An example of algorithm is gradient descent with fixed step size \(\eta {\gt} 0\): this is a deterministic algorithm defined by \(P_0 = \delta _{x_0}\) for some initial point \(x_0 \in \mathbb {R}^d\) and for all \(t \in \mathbb {N}\), \(\pi _t(h_t) = \delta _{x_{t+1}}\) where \(x_{t+1} = x_t - \eta \nabla g_2(x_t)\).

  2. Stochastic bandits. The action space is \(\mathcal{A} = [K]\) for some \(K \in \mathbb {N}\) (the set of arms) and the observation space is \(\mathcal{R} = \mathbb {R}\) (the reward obtained after pulling an arm). The kernel \(\nu _t\) is stationary and depends only on the action: there are probability distributions \((P_a)_{a \in [K]}\) such that for all \(t \in \mathbb {N}\), for all \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), for all \(a \in \mathcal{A}\), \(\nu _t(h_t, a) = P_a\).

  3. Adversarial bandits. The action space is \(\mathcal{A} = [K]\) for some \(K \in \mathbb {N}\) (the set of arms) and the observation space is \(\mathcal{R} = \mathbb {R}\) (the reward obtained after pulling an arm). The reward kernels are usually taken to be deterministic and in an oblivious adversarial bandit they depend only on the time step: there is a sequence of vectors \((r_t)_{t \in \mathbb {N}}\) in \([0,1]^K\) such that for all \(t \in \mathbb {N}\), for all \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), for all \(a \in \mathcal{A}\), \(\nu _t(h_t, a) = \delta _{r_{t,a}}\) (the Dirac measure at \(r_{t,a}\)).

  4. Reinforcement learning in Markov decision processes. TODO: main feature is that \(\mathcal{R} = \mathcal{S} \times \mathbb {R}\) where \(\mathcal{S}\) is the state space, and the kernel \(\nu _t\) depends on the last state only.

We will want to make global probabilistic statements about the whole sequence of actions and observations. For example, we may want to prove that an optimization algorithm converges to the minimum of a function almost surely. For such a statement to make sense, we need a probability space on which the whole sequence of actions and observations is defined as a random variable.

We denote by \(P[X \mid Y]\) the conditional distribution of a random variable \(X\) given another random variable \(Y\) under a probability measure \(P\). When we write that \(P[X \mid Y] = \kappa \), or that \(X\) has conditional distribution \(\kappa \) given \(Y\), the equality should be understood as holding \(Y_* P\)-almost surely.

Definition 2.5 Algorithm-environment interaction
#

Let \(\mathfrak {A}\) be an algorithm as in Definition 2.1 and \(\mathfrak {E}\) be an environment as in Definition 2.2. A probability space \((\Omega , P)\) and two sequences of random variables \(A : \mathbb {N} \to \Omega \to \mathcal{A}\) and \(R : \mathbb {N} \to \Omega \to \mathcal{R}\) form an algorithm-environment interaction for \(\mathfrak {A}\) and \(\mathfrak {E}\) if the following conditions hold:

  1. The law of \(A_0\) is \(P_0\).

  2. \(P \left[ R_0 \mid A_0 \right] = \nu '_0\).

  3. For all \(t \in \mathbb {N}\), \(P\left[A_{t+1} \mid A_0, R_0, \ldots , A_t, R_t \right] = \pi _t\).

  4. For all \(t \in \mathbb {N}\), \(P\left[R_{t+1} \mid A_0, R_0, \ldots , A_t, R_t, A_{t+1}\right] = \nu _t\).

Definition 2.6 History

For two sequences of random variables \(A : \mathbb {N} \to \Omega \to \mathcal{A}\) and \(R : \mathbb {N} \to \Omega \to \mathcal{R}\) (actions and observations), we call step of the interaction at time \(t\) the random variable \(X_t : \Omega \to \mathcal{A} \times \mathcal{R}\) defined by \(X_t(\omega ) = (A_t(\omega ), R_t(\omega ))\). We call history up to time \(t\) the random variable \(H_t : \Omega \to (\mathcal{A} \times \mathcal{R})^{t+1}\) defined by \(H_t(\omega ) = (X_0(\omega ), \ldots , X_t(\omega ))\).

In an algorithm-environment interaction \((A, R, P)\) as in Definition 2.5,

  • the law of the initial step \(X_0\) is \(P_0 \otimes \nu '_0\),

  • for all \(t \in \mathbb {N}\), \(P \left[ X_{t+1} \mid H_t \right] = \pi _t \otimes \nu _t\).

Proof

Immediate from the properties of an algorithm-environment interaction.

For an algorithm-environment interaction \((A, R, P)\) as in Definition 2.5, we denote by \(\mathcal{F}_t\) the sigma-algebra generated by the history up to time \(t\): \(\mathcal{F}_t = \sigma (H_t)\). We denote by \(\mathcal{F}^A_t\) the sigma-algebra generated by the history up to time \(t-1\) and the action at time \(t\): \(\mathcal{F}^A_t = \sigma (H_{t-1}, A_t)\).

Theorem 2.9 [ LS20 ] , Proposition 4.8
#

If \((A, R, P)\) and \((A', R', P')\) are two algorithm-environment interactions for the same algorithm \(\mathfrak {A}\) and environment \(\mathfrak {E}\), then the joint distributions of the sequences of actions and observations are equal: the law of \((A_i, R_i)_{i \in \mathbb {N}}\) under \(P\) is equal to the law of \((A'_i, R'_i)_{i \in \mathbb {N}}\) under \(P'\).

Proof

2.1 Stationary environment

Recall that in a stationary environment, there exists a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathcal{R}\) such that \(\nu '_0 = \nu \) and for all \(t \in \mathbb {N}\), for all \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), for all \(a \in \mathcal{A}\), \(\nu _t(h_t, a) = \nu (a)\).

Let \((A, R, P)\) be an algorithm-environment interaction in a stationary environment with kernel \(\nu \).

In a stationary environment, for any \(t \in \mathbb {N}\), the conditional distribution \(P\left[R_t \mid A_t\right]\) is \((A_{t*} P_{\mathcal{T}})\)-almost surely equal to \(\nu \).

Proof

In a stationary environment, for any \(t \in \mathbb {N}\), the reward \(R_{t+1}\) is conditionally independent of the history \(H_t\) given the action \(A_{t+1}\) (more succinctly, \(R_{t+1} \perp \! \! \! \! \perp H_t \mid A_{t+1}\)).

Proof

2.2 Probability space: Ionescu-Tulcea theorem

In Theorem 2.9, we saw that the distribution of the sequence of actions and observations in a suitable probability space is uniquely determined by the algorithm and the environment. We now show that such a probability space actually exists: for any algorithm and environment, we build an algorithm-environment interaction as in Definition 2.5.

2.2.1 Ionescu-Tulcea theorem

If we group together the policy of the algorithm and the kernel of the environment at each time step, we get a sequence of Markov kernels \((\kappa _t)_{t \in \mathbb {N}}\), with \(\kappa _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow (\mathcal{A} \times \mathcal{R})\).

We now abstract that situation and consider a sequence of measurable spaces \((\Omega _t)_{t \in \mathbb {N}}\), a probability measure \(\mu \) on \(\Omega _0\) and a sequence of Markov kernels \(\kappa _t : \prod _{s=0}^t \Omega _s \rightsquigarrow \Omega _{t+1}\). The Ionescu-Tulcea theorem builds a probability space from the sequence of kernels and the initial measure.

Theorem 2.12 Ionescu-Tulcea
#

Let \((\Omega _t)_{t \in \mathbb {N}}\) be a family of measurable spaces. Let \((\kappa _t)_{t \in \mathbb {N}}\) be a family of Markov kernels such that for any \(t\), \(\kappa _t\) is a kernel from \(\prod _{i=0}^t \Omega _{i}\) to \(\Omega _{t+1}\). Then there exists a unique Markov kernel \(\xi : \Omega _0 \rightsquigarrow \prod _{i = 1}^{\infty } \Omega _{i}\) such that for any \(n \ge 1\), \(\pi _{[1,n]*} \xi = \kappa _0 \otimes \ldots \otimes \kappa _{n-1}\). Here \(\pi _{[1,n]} : \prod _{i=1}^{\infty } \Omega _i \to \prod _{i=1}^n \Omega _i\) is the projection on the first \(n\) coordinates.

Proof

The Ionescu-Tulcea theorem in Mathlib [ Mar25 ] actually generates kernels \(\xi _t : \prod _{s=0}^t \Omega _s \rightsquigarrow \prod _{s=0}^{\infty } \Omega _s\) for any \(t\), with the property that the kernels are the identity on the first \(t+1\) coordinates.

Definition 2.13 Trajectory measure
#

For \(\mu \in \mathcal{P}(\Omega _0)\), we call trajectory measure the probability measure \(\xi _0 \circ \mu \) on \(\Omega _{\mathcal{T}} := \prod _{i=0}^{\infty } \Omega _i\). We denote it by \(P_{\mathcal{T}}\). The \(\mathcal{T}\) subscript stands for “trajectory”.

Definition 2.14 Step and history
#

For \(t \in \mathbb {N}\), we denote by \(X_t \in \Omega _t\) the random variable describing the time step \(t\), and by \(H_t \in \prod _{s=0}^t \Omega _s\) the history up to time \(t\). Formally, these are measurable functions on \(\Omega _{\mathcal{T}}\), defined by \(X_t(\omega ) = \omega _t\) and \(H_t(\omega ) = (\omega _1, \ldots , \omega _t)\).

Note: \((X_t)_{t \in \mathbb {N}}\) is the canonical process on \(\Omega _{\mathcal{T}}\). \(H_t\) is equal to \(\pi _{[0,t]}\).

Definition 2.15 Filtration
#

For \(t \in \mathbb {N}\), we denote by \(\mathcal{F}_t\) the sigma-algebra generated by the history up to time \(t\): \(\mathcal{F}_t = \sigma (H_t)\). The family \((\mathcal{F}_t)_{t \in \mathbb {N}}\) is a filtration on \(\Omega _{\mathcal{T}}\).

\((\mathcal{F}_t)_{t \in \mathbb {N}}\) is the canonical filtration on \(\Omega _{\mathcal{T}}\), and is the natural filtration for the canonical process \((X_t)_{t \in \mathbb {N}}\).

The random variables \(X_t\) and \(H_t\) are \(\mathcal{F}_t\)-measurable. Said differently, the processes \((X_t)_{t \in \mathbb {N}}\) and \((H_t)_{t \in \mathbb {N}}\) are adapted to the filtration \((\mathcal{F}_t)_{t \in \mathbb {N}}\).

Proof

For any \(t \in \mathbb {N}\), the conditional distribution \(P_{\mathcal{T}}\left[X_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to \(\kappa _t\).

Proof

This is proved through the defining property of the conditional distribution: it is the almost surely unique Markov kernel \(\eta \) such that \(((H_t)_* P_{\mathcal{T}}) \otimes \eta = (H_t, X_{t+1})_*P_{\mathcal{T}}\).

TODO: complete proof.

Lemma 2.18

The law of \(X_0\) under \(P_{\mathcal{T}}\) is \(\mu \).

Proof

2.2.2 Case of an algorithm-environment interaction

We now go back to the setting of an algorithm interacting with an environment and suppose that \(\Omega _t = \mathcal{A} \times \mathcal{R}\) for some measurable spaces \(\mathcal{A}\) and \(\mathcal{R}\), and that for all \(t \in \mathbb {N}\), \(\kappa _t = \pi _t \otimes \nu _t\) for policy kernels \(\pi _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow \mathcal{A}\) and feedback kernels \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\). Likewise, \(\mu = P_0 \otimes \nu '_0\) for a probability measure \(P_0\) on \(\mathcal{A}\) and a Markov kernel \(\nu '_0 : \mathcal{A} \rightsquigarrow \mathcal{R}\). The step random variable \(X_t\) takes values in \(\mathcal{A} \times \mathcal{R}\).

Definition 2.19
#

We write \(A_t\) and \(R_t\) for the projections of \(X_t\) on \(\mathcal{A}\) and \(\mathcal{R}\) respectively. \(A_t\) is the action taken at time \(t\) and \(R_t\) is the reward received at time \(t\). Formally, \(A_t(\omega ) = \omega _{t,1}\) and \(R_t(\omega ) = \omega _{t,2}\) for \(\omega = \prod _{t=0}^{+\infty }(\omega _{t,1}, \omega _{t,2}) \in \Omega _{\mathcal{T}} = \prod _{t=0}^{+\infty } \mathcal{A} \times \mathcal{R}\).

The random variables \(A_t\) and \(R_t\) are \(\mathcal{F}_t\)-measurable. Said differently, the processes \((A_t)_{t \in \mathbb {N}}\) and \((R_t)_{t \in \mathbb {N}}\) are adapted to the filtration \((\mathcal{F}_t)_{t \in \mathbb {N}}\).

Proof

We need to check that the random variables \(A_t\) and \(R_t\) have the expected conditional distributions.

For any \(t \in \mathbb {N}\), \(P_{\mathcal{T}}\left[A_{t+1} \mid H_t\right] = \pi _t\).

Proof

By Lemma 2.17, \(P_{\mathcal{T}}\left[X_{t+1} \mid H_t\right] = \kappa _t = \pi _t \otimes \nu _t\). Since \(A_{t+1}\) is the projection of \(X_{t+1}\) on \(\mathcal{A}\), \(P_{\mathcal{T}}\left[A_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to the projection of \(\kappa _t\) on \(\mathcal{A}\), which is \(\pi _t\).

For any \(t \in \mathbb {N}\), \(P_{\mathcal{T}}\left[R_{t+1} \mid H_t, A_{t+1}\right] = \nu _t\).

Proof

It suffices to show that \(((H_t, A_{t+1})_* P_{\mathcal{T}}) \otimes \nu _t = (H_t, A_{t+1}, R_{t+1})_* P_{\mathcal{T}} = (H_t, X_{t+1})_* P_{\mathcal{T}}\). By Lemma 2.17, \(P_{\mathcal{T}}\left[X_{t+1} \mid H_t\right] = \pi _t \otimes \nu _t\). Thus \(((H_t)_* P_{\mathcal{T}}) \otimes (\pi _t \otimes \nu _t) = (H_t, X_{t+1})_* P_{\mathcal{T}}\).

We thus have to prove that \(((H_t)_* P_{\mathcal{T}}) \otimes (\pi _t \otimes \nu _t) = ((H_t, A_{t+1})_* P_{\mathcal{T}}) \otimes \nu _t\).

By Lemma 2.21, \((H_t, A_{t+1})_* P_{\mathcal{T}} = (H_t)_* P_{\mathcal{T}} \otimes \pi _t\), and replacing this in the right-hand side gives the left-hand side (using associativity of the composition-product).

The law of \(A_0\) under \(P_{\mathcal{T}}\) is \(P_0\).

Proof

\(X_0\) has law \(\mu = P_0 \otimes \nu '_0\). \(A_0\) is the projection of \(X_0\) on the first space \(\mathcal{A}\) and \(\nu _0'\) is Markov, so \(A_0\) has law \(P_0\).

\(P_{\mathcal{T}}\left[R_0 \mid A_0\right] = \nu '_0\).

Proof

To prove almost sure equality, it is enough to prove that \((A_{0*} P_{\mathcal{T}}) \otimes P_{\mathcal{T}}\left[R_0 \mid A_0\right] = (A_{0*} P_{\mathcal{T}}) \otimes \nu '_0\). By definition of the conditional distribution, we have \((A_{0*} P_{\mathcal{T}}) \otimes P_{\mathcal{T}}\left[R_0 \mid A_0\right] = (A_0, R_0)_* P_{\mathcal{T}} = X_{0*} P_{\mathcal{T}}\). By Lemma 2.18, \(X_{0*} P_{\mathcal{T}} = \mu = P_0 \otimes \nu '_0\). By Lemma 2.23, \(A_{0*} P_{\mathcal{T}} = P_0\). Thus the two sides are equal.

In the probability space \((\Omega _{\mathcal{T}}, P_{\mathcal{T}})\) constructed from an algorithm \(\mathfrak {A}\) and an environment \(\mathfrak {E}\) as above, the sequences of random variables \(A : \mathbb {N} \to \Omega _{\mathcal{T}} \to \mathcal{A}\) and \(R : \mathbb {N} \to \Omega _{\mathcal{T}} \to \mathcal{R}\) form an algorithm-environment interaction for \(\mathfrak {A}\) and \(\mathfrak {E}\).

Proof

The four conditions of Definition 2.5 are exactly the statements of Lemmas 2.23, 2.24, 2.21 and 2.22.

2.3 Finitely many actions

When the number of actions is finite, it makes sense to count how many times each action was chosen up to a certain time. We can also define the time step at which an action was chosen a certain number of times, and the value of the reward obtained when pulling an action for the \(m\)-th time.

Definition 2.26 Pull counts
#

For an action \(a \in \mathcal{A}\) and a time \(t \in \mathbb {N}\), we denote by \(N_{t,a}\) the number of times that action \(a\) has been chosen before time \(t\), that is \(N_{t,a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} \).

Note that the sum goes up to \(t-1\), so that \(N_{t,a}\) counts the number of times action \(a\) was chosen before time \(t\).

Remark 2.27 Building vs analyzing algorithms
#

When we describe an algorithm, we give the data of the policies \(\pi _t\), which are functions of the partial history up to time \(t\), in \((\mathcal{A} \times \mathcal{R})^{t+1}\). That means that any tool used to define a policy must be a function defined on \((\mathcal{A} \times \mathcal{R})^{t+1}\). For example a definition of the empirical mean of an action must be a function \(t : \mathbb {N} \to (\mathcal{A} \times \mathcal{R})^{t+1} \to \mathbb {R}\).

When we analyze an algorithm, we work on the other hand on a probability space \((\Omega , P)\), in which \(\Omega \) could be for example \((\mathcal{A} \times \mathcal{R})^{\mathbb {N}}\), the full history, which describes the whole sequence of actions and rewards. As a stochastic process, the empirical mean of an action is a function \(\mathbb {N} \to (\mathcal{A} \times \mathcal{R})^{\mathbb {N}} \to \mathbb {R}\).

Thus there are two similar but still distinct types of objects: those defined on the partial history, which are used to build algorithms, and those defined on a generic probability space (the full history in the Ionescu-Tulcea construction), which are used to analyze algorithms.

We note the following basic properties of \(N_{t,a}\):

  • \(N_{0,a} = 0\).

  • \(N_{t,a}\) is non-decreasing in \(t\).

  • \(N_{t + 1, A_t} = N_{t, A_t} + 1\) and for \(a \ne A_t\), \(N_{t + 1, a} = N_{t, a}\).

  • \(N_{t, a} \le t\).

  • If for all \(s \le t\), \(A_s(\omega ) = A_s(\omega ')\), then \(N_{t+1, a}(\omega ) = N_{t+1, a}(\omega ')\).

Proof
Lemma 2.29

Let \(a \in \mathcal{A}\). The process \((N_{t,a})_{t \in \mathbb {N}}\) is predictable with respect to the filtration \(\mathcal{F}\) of the algorithm-environment interaction.

Proof
Definition 2.30
#

For an action \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a} \in \mathbb {N} \cup \{ +\infty \} \) the time at which action \(a\) was chosen for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the action is not chosen \(n\) times.

By definition, \(T_{n, a}\) is the hitting time of the set \(\{ n\} \) by the process \(t \mapsto N_{t+1,a}\), which is adapted since \(N_{t,a}\) is predictable. Equivalently, \(T_{n, a}\) is the hitting time of the set \([n, +\infty ]\) by that process.

We note the following basic properties of \(T_{n,a}\):

  • \(T_{0,a} = 0\) for \(a \ne A_0\). \(T_{0,A_0} = \infty \).

  • \(T_{N_{t+1, a}, a} \le t\).

  • \(T_{N_{t + 1, A_t}, A_t} = t\).

  • If \(T_{n, a} \ne \infty \) and \(n {\gt} 0\), then \(A_{T_{n, a}} = a\).

  • If \(T_{n, a} \ne \infty \), then \(N_{T_{n, a} + 1, a} = n\).

  • If \(T_{n, a} \ne \infty \) and \(n {\gt} 0\), then \(N_{T_{n, a}, a} = n - 1\).

  • If for all \(s \le t\), \(A_s(\omega ) = A_s(\omega ')\), then \(T_{n, a}(\omega ) = t \iff T_{n, a}(\omega ') = t\).

Proof
Lemma 2.32

Let \(a \in \mathcal{A}\). For any \(n {\gt} 0\), the random variable \(T_{n,a}\) is a stopping time with respect to the filtration \(\mathcal{F}\).

Proof

A hitting time of a set by an adapted process is a stopping time.

Let \(\Omega ' = \mathcal{R}^{\mathbb {N} \times \mathcal{A}}\) and let \(\Omega = \Omega _{\mathcal{T}} \times \Omega '\) (which will be an extension of the trajectory probability space once we choose a measure on \(\Omega '\)). Let \(Z_{n, a} : \Omega \to \mathcal{R}\) be the projection on the coordinate indexed by \((n,a)\) in \(\Omega '\). Extending the probability space in that way allows us to define without ambiguity the reward received when choosing an action for the \(n\)-th time, even if that action is never actually chosen \(n\) times.

Definition 2.33 nth reward
#

We define \(Y_{n, a} = R_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when choosing action \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise. In that expression, we see \(R_{T_{n,a}}\) and \(T_{n, a}\) as random variables on \(\Omega \) instead of \(\Omega _{\mathcal{T}}\).

\(Y_{N_{t, A_t} + 1, A_t} = R_t\).

Proof

This is perhaps hard to parse at first sight, but it follows directly from the definitions. At time \(t\), the action chosen is \(A_t\) and we see a reward \(R_t\). That action had already been chosen \(N_{t, A_t}\) times before time \(t\), so the reward \(R_t\) is the reward received when choosing action \(A_t\) for the \((N_{t, A_t} + 1)\)-th time, which is \(Y_{N_{t, A_t} + 1, A_t}\) by definition.

2.4 Scalar rewards

TODO: change the name “reward” to “observation” throughout the chapter?

We now focus on the case where the reward space is \(\mathcal{R} = \mathbb {R}\).

Definition 2.35 Sum of rewards
#

Let \(S_{t, a} = \sum _{s=0}^{t-1} R_s \mathbb {I}\{ A_s = a\} \) be the sum of the rewards obtained by chosing action \(a\) before time \(t\).

Definition 2.36 Empirical mean
#

Let \(\hat{\mu }_{t, a} = \frac{S_{t, a}}{N_{t, a}} = \frac{1}{N_{t,a}} \sum _{s=0}^{t-1} R_s \mathbb {I}\{ A_s = a\} \) if \(N_{t, a} {\gt} 0\), and \(\hat{\mu }_{t, a} = 0\) otherwise. This is the empirical mean of the rewards obtained by choosing action \(a\) before time \(t\).

Note: in bandit papers it is common to (implicitly) define the empirical mean as \(+\infty \) when the action was never chosen, but in Lean it has to be a real number, and the Lean default value for division by zero is \(0\).

The following lemma is very useful to relate the two ways of indexing the rewards: by time step and by pull count.

The sum of the first \(N_{t, a}\) rewards received when choosing action \(a\) is equal to the sum of the rewards obtained by choosing action \(a\) before time \(t\):

\begin{align*} \sum _{n=1}^{N_{t, a}} Y_{n, a} = S_{t,a} \: . \end{align*}
Proof