LeanBandits

3 Stochastic multi-armed bandits

3.1 Algorithm, bandit and probability space

A bandit algorithm is an algorithm in the sense of Definition 2.1. We call the actions arms and the observations rewards.

The first arm pulled by the algorithm is sampled from \(P_0\), the arm pulled at time \(1\) is sampled from \(\pi _0(H_0)\), where \(H_0 \in \mathcal{A} \times \mathcal{R}\) is the data of the first arm pulled and the first observation received, and so on.

Definition 3.1 Bandit

A stochastic bandit is simply a reward distribution for each arm: a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), conditional distribution of the rewards given the arm pulled. It is a stationary environment in which the observation space is \(\mathcal{R} = \mathbb {R}\).

An algorithm can interact with a bandit to produce a sequence of arms and rewards: after a time \(t\), the history \(H_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\) contains the arms pulled and rewards received up to that time,

  • the algorithm chooses an arm \(A_{t+1}\) sampled according to its policy \(\pi _t(H_t)\),

  • the bandit generates a reward \(R_{t+1}\) according to the distribution \(\nu (A_{t+1})\),

  • the history is updated to \(H_{t+1} = ((A_0, R_0), \ldots , (A_{t+1}, R_{t+1}))\).

We now want to define a probability space on which we can study the sequences of arms and rewards, and formulate probabilistic statements about the interaction between the algorithm and the bandit.

Definition 3.2 Bandit probability space

As in Definition 2.13, an algorithm \((\pi , P_0)\) and bandit \(\nu \) together defines a probability distribution \(\mathbb {P}_{\mathcal{T}}\) on the space \(\Omega _{\mathcal{T}} := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We augment that probability space with a stream of rewards from each arm, independent of the bandit interaction, to get the probability space \((\Omega , \mathbb {P})\), where \(\Omega = \Omega _{\mathcal{T}} \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\) and \(\mathbb {P} = \mathbb {P}_{\mathcal{T}} \otimes (\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a))\).

3.2 The array model of rewards

We previously built a probability space on which we can define the sequence of arms and rewards generated by the interaction between the algorithm and the bandit, using the Ionescu-Tulcea theorem. From Theorem 2.9, we know that the law of the sequence of arms and rewards is independent of the probability space used to define them. Nonetheless, we now build an alternative model of the rewards, on which it will be easier to prove concentration inequalities. By the uniqueness of the law, these statements will then transfer to any algorithm-environment interaction.

In the “array model”, we consider a probability space on which we have an infinite array of rewards from each arm, independent from each other. When pulling an arm, the algorithm sees the next previously unseen reward from that arm in the array.

Definition 3.3

Let \(I = [0,1]\) and let \(P_U\) be the uniform distribution on \(I\). We define the probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\), where

\begin{align*} \Omega _{\mathcal{A}} & := I^{\mathbb {N}} \times \mathcal{R}^{\mathbb {N} \times \mathcal{A}} \: , \\ P_{\mathcal{A}} & := \left( \bigotimes _{n \in \mathbb {N}} P_U \right) \otimes \left( \bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a) \right) \: . \end{align*}
Definition 3.4

Let \(\mathfrak {A}\) be an algorithm with action space \(\mathcal{A}\) and reward space \(\mathcal{R}\), policy \(\pi \) and initial distribution \(P_0\). For \(\mathcal{A}\) and \(\mathcal{R}\) standard Borel spaces, there exists jointly measurable functions \(f'_0 : I \to \mathcal{A}\) and \(f_t : (\mathcal{A} \times \mathcal{R})^{t+1} \times I \to \mathcal{A}\) such that

  • the law of \(f'_0\) is \(P_0\),

  • for all history \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), the law of \(f_t(h_t, \cdot )\) is \(\pi _t(h_t)\).

The history, actions and rewards on the array model probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\) are defined as follows:

  • the action at time \(0\) is \(A_0(\omega ) = f'_0(\omega _{1,0})\), the reward at time \(0\) is \(R_0(\omega ) = \omega _{2,0,A_0(\omega )}\), and the history at time \(0\) is \(H_0(\omega ) = (A_0(\omega ), R_0(\omega ))\),

  • for \(t \ge 0\), the action at time \(t+1\) is \(A_{t+1}(\omega ) = f_t(H_t(\omega ), \omega _{1,t+1})\), the reward at time \(t+1\) is \(R_{t+1}(\omega ) = \omega _{2,N_{t+1,A_{t+1}(\omega )},A_{t+1}(\omega )}\), and the history at time \(t+1\) is \(H_{t+1}(\omega ) = (H_t(\omega ), (A_{t+1}(\omega ), R_{t+1}(\omega )))\).

The goal of this section is to show that \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\) with the actions and rewards defined above is an algorithm-environment sequence as in Definition 2.5.

3.2.1 Measurability

TODO: some of those results are proved for \(\mathcal{A}\) countable. Add that assumption where needed.

Remark 3.6 Proving measurability with respect to a sub-sigma-algebra
#

In this section, we often need to prove that a random variable \(X\) is measurable with respect to the sigma-algebra generated by a subset of the independent random variables defining the probability space \(\Omega _{\mathcal{A}}\). We want to prove that \(X : \Omega _{\mathcal{A}} \to \mathcal{X}\) is measurable with respect to the sigma-algebra \(\sigma ((\omega _p)_{p \in S})\), where \(S\) is a subset of the indices of the independent random variables defining \(\Omega _{\mathcal{A}}\). In most cases, this is due to \(X\) being defined (possibly recursively in a complicated way) only using those random variables. However, it might be difficult to exhibit an explicit function \(f\) such that \(X = f((\omega _p)_{p \in S})\).

Here is a general strategy to prove such measurability results in Lean:

  1. Prove that \(X\) is measurable with respect to the full sigma-algebra on \(\Omega _{\mathcal{A}}\).

  2. Prove a congruence lemma: for any \(\omega , \omega ' \in \Omega _{\mathcal{A}}\), if \(\omega _p = \omega '_p\) for all \(p \in S\), then \(X(\omega ) = X(\omega ')\).

  3. Define \(g : (\prod _{p \in S} \Omega _p) \to \Omega \) by \(g((\omega _p)_{p \in S}) = \omega '\), where \(\omega '_p = \omega _p\) for \(p \in S\) and \(\omega '_p\) is some fixed value for \(p \notin S\).

  4. Write \(X = X \circ g \circ \mathrm{proj}_S\), where \(\mathrm{proj}_S : \Omega _{\mathcal{A}} \to \prod _{p \in S} \Omega _p\) is the projection on the coordinates in \(S\).

  5. Conclude that \(X\) is the composition of a measurable function \((X \circ g)\) and the random variable generating the sub-sigma-algebra, and thus is measurable with respect to that sub-sigma-algebra.

\(H_t\), \(N_{t,A_t}\), \(A_t\) and \(R_t\) are measurable for all \(t \in \mathbb {N}\).

Proof
Lemma 3.8 Congruence for the history
#

Let \(\omega , \omega ' \in \Omega _{\mathcal{A}}\) and \(t \in \mathbb {N}\). Suppose that

\begin{align*} \forall s \le t, \ \omega _{1,s} & = \omega ’_{1,s} \: , \\ \forall a \in \mathcal{A}, \forall s {\lt} N_{t+1,a}, \ \omega _{2,s,a} & = \omega ’_{2,s,a} \: . \end{align*}

Then \(H_t(\omega ) = H_t(\omega ')\).

Proof
Lemma 3.9 Congruence for the number of pulls and action

Let \(\omega , \omega ' \in \Omega _{\mathcal{A}}\), \(t, m \in \mathbb {N}\) and \(a \in \mathcal{A}\). Suppose that

\begin{align*} \omega _{1} & = \omega ’_{1} \: , \\ \forall s {\lt} m, \ \omega _{2,s,a} & = \omega ’_{2,s,a} \: , \\ \forall b \ne a, \forall s \in \mathbb {N}, \ \omega _{2,s,b} & = \omega ’_{2,s,b} \: . \end{align*}

Then \((N_{t+1, a}(\omega ) = m \wedge A_{t+1}(\omega ) = a) \iff (N_{t+1, a}(\omega ') = m \wedge A_{t+1}(\omega ') = a)\).

Proof
Definition 3.10
#

We define the following functions on \(\Omega _{\mathcal{A}}\):

\begin{align*} F_{1, t}(\omega ) & = ((\omega _{1,s})_{s \le t}, (\omega _{2,s})_{s \in \mathbb {N}}) \: , \\ F_{2, a, t}(\omega ) & = ((\omega _{1,s})_{s \in \mathbb {N}}, (\omega _{2, \min \{ s,N_{t+1,a}(\omega )-1\} , a})_{s \in \mathbb {N}}, (\omega _{2, s, b})_{s \in \mathbb {N}, b \ne a}) \: , \\ F_{2, a}^m(\omega ) & = ((\omega _{1,s})_{s \in \mathbb {N}}, (\omega _{2, \min \{ s,m-1\} , a})_{s \in \mathbb {N}}, (\omega _{2, s, b})_{s \in \mathbb {N}, b \ne a}) \: . \end{align*}

In the definition of \(F_{2, a, t}\) and \(F_{2, a}^m\), if \(N_{t+1,a}(\omega ) = 0\) (resp. \(m = 0\)), then the second component is a constant sequence equal to an arbitrary value.

\(F_{1, t}(\omega )\) contains all the information in \(\omega \) except for the action selection randomness after time \(t\).

\(F_{2, a, t}(\omega )\) contains all the information in \(\omega \) except the rewards for arm \(a\) indexed by \(N_{t+1,a}(\omega )\) or more. \(F_{2, a}^m(\omega )\) is similar, but removes the rewards for arm \(a\) indexed by \(m\) or more.

For all \(t \in \mathbb {N}\), \(H_t\) is measurable with respect to the sigma-algebra generated by \(F_{1, t}\), and with respect to the sigma-algebra generated by \(F_{2, a, t}\) for any arm \(a \in \mathcal{A}\).

Proof

\(A_{t+1}\) is measurable with respect to the sigma-algebra generated by \(F_{2, a, t}\) for any arm \(a \in \mathcal{A}\).

Proof

\(N_{t+1,a}\) is measurable with respect to the sigma-algebra generated by \(F_{2, a, t}\).

Proof

For \(t, m \in \mathbb {N}\) and \(a \in \mathcal{A}\), the indicator function \(\mathbb {I}\{ N_{t+1,a} = m \wedge A_{t+1} = a\} : \Omega _{\mathcal{A}} \to \{ 0, 1\} \) is measurable with respect to the sigma-algebra generated by \(F_{2, a}^m\).

Proof

For \(t \in \mathbb {N}\), the function \(N_{t+1, A_{t+1}}\) is measurable with respect to the sigma-algebra generated by \(H_t\) and \(A_{t+1}\).

Proof

3.2.2 Independence

Lemma 3.16

\(\omega \mapsto \omega _{1, t+1}\) is independent of \(F_{1, t}\).

Proof

\(\omega \mapsto \omega _{1, t+1}\) is independent of \(H_t\).

Proof
Lemma 3.18

For \(a \in \mathcal{A}\) and \(m \in \mathbb {N}\), \(\omega \mapsto \omega _{2, m, a}\) is independent of \(F_{2, a}^m\).

Proof

For \(a \in \mathcal{A}\), \(\omega \mapsto \omega _{2, m, a}\) is independent of the indicator function \(\mathbb {I}\{ N_{t+1,a} = m \wedge A_{t+1} = a\} \).

Proof

For \(a \in \mathcal{A}\) and \(t, m \in \mathbb {N}\), \(\omega \mapsto \omega _{2, m, a}\) is independent of \(H_t\) given that \(N_{t+1,a} = m\) and \(A_{t+1} = a\).

Proof

3.2.3 Laws

Lemma 3.21

The law of \(A_0\) in the array model is \(P_0\).

Proof
Lemma 3.22

In the array model, \(P_{\mathcal{A}}[R_0 \mid A_0] = \nu \).

Proof

In the array model, \(P_{\mathcal{A}}[A_{t+1} \mid H_t] = \pi _t\).

Proof

In the array model, \(P_{\mathcal{A}}[R_{t+1} \mid N_{t+1,A_{t+1}}, A_{t+1}] = \nu \).

Proof

In the array model, \(P_{\mathcal{A}}[R_{t+1} \mid H_t, A_{t+1}, N_{t+1,A_{t+1}}] = \nu \).

Proof
Lemma 3.26

For \(t \ge 0\), \(R_{t+1} \perp \! \! \! \! \perp H_t \mid A_{t+1}, N_{t+1, A_{t+1}}\).

Proof

In the array model, \(P_{\mathcal{A}}[R_{t+1} \mid H_t, A_{t+1}] = \nu \).

Proof

The actions and rewards defined on the array model probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\) form an algorithm-environment sequence for the algorithm \(\mathfrak {A}\) and bandit \(\nu \).

Proof

The four conditions of Definition 2.5 are satisfied by Lemmas 3.21, 3.22, 3.23 and 3.27.

3.3 Law of the nth pull

This section describes the law of \(Y_{n, a}\) the \(n^{th}\) reward obtained from an arm \(a \in \mathcal{A}\) (Definition 2.33).

We augment the probability space \(\Omega \) on which we have an algorithm-environment sequence with \(\Omega ' = \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\), on which we put the product measure \(\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a)\). With that measure, the law of \(Z_{n,a}\) is \(\nu (a)\).

Lemma 3.29

The function \(\mathbb {I}\{ T_{n,a} = t\} : \Omega \to \{ 0, 1\} \) is measurable with respect to the sigma-algebra generated by \((H_{t-1}, A_t)\).

Proof

For \(t {\gt} 0\), \(R_t \perp \! \! \! \! \perp \mathbb {I}\{ T_{n, a} = t\} \mid A_t\).

Proof

\(\mathbb {I}\{ T_{n, a} = t\} \) is measurable with respect to the sigma-algebra generated by \((H_{t-1}, A_t)\) by Lemma 3.29.

It thus suffices to show that \(R_t \perp \! \! \! \! \perp (H_{t-1}, A_t) \mid A_t\), which is implied by \(R_t \perp \! \! \! \! \perp H_{t-1} \mid A_t\) (Lemma A.1), which is Lemma 2.11.

Let \(n {\gt} 0\), \(t \in \mathbb {N}\) and suppose that \(P(T_{n, a} = t) {\gt} 0\). Then \(\mathcal{L}(R_t \mid T_{n, a} = t) = \nu (a)\).

Proof

First, if \(T_{n, a} = t\), then \(A_t = a\) (Lemma 2.31), such that \(\mathcal{L}(R_t \mid T_{n, a} = t) = \mathcal{L}(R_t \mid T_{n, a} = t, A_t = a)\).

Then, using first the independence from Lemma 3.30 and then the conditional distribution from Lemma 2.10, we have

\begin{align*} \mathcal{L}(R_t \mid T_{n, a} = t, A_t = a) & = \mathcal{L}(R_t \mid A_t = a) = \nu (a) \: . \end{align*}
Lemma 3.32
#

For a random variable \(X\) on a countable space with the discrete sigma algebra, \(\mathcal{L}(Y \mid X) = (x \mapsto \mathcal{L}(Y \mid X = x))\), \((X_*P)\)-almost surely. Furthermore, that almost sure equality means that for all \(x\) such that \(P(X = x) {\gt} 0\), we have \(\mathcal{L}(Y \mid X = x) = \mathcal{L}(Y \mid X)(x)\).

Proof

For \(n {\gt} 0\) and \(t \in \mathbb {N}\), \(\mathcal{L}(Y_{n,a} \mid T_{n,a}) = \nu (a)\) (in which the measure on the r.h.s. is seen as a constant kernel).

Proof

It suffices to show that for all \(t \in \mathbb {N} \cup \{ \infty \} \) such that \(\mathbb {P}(T_{n, a} = t) {\gt} 0\), the law of \(Y_{n,a}\) conditioned on \(T_{n,a} = t\) is \(\nu (a)\).

If \(t {\lt} \infty \), then \(\mathcal{L}(Y_{n, a} \mid T_{n, a} = t) = \mathcal{L}(R_t \mid T_{n, a} = t) = \nu (a)\) by Lemma 3.31.

If \(t = \infty \), then \(\mathcal{L}(Y_{n, a} \mid T_{n, a} = \infty ) = \mathcal{L}(Z_{n, a} \mid T_{n, a} = \infty )\). By independence of \(Z_{n,a}\) and \(T_{n, a}\), this is just \(\nu (a)\), the law of \(Z_{n,a}\).

Lemma 3.34

For \(n {\gt} 0\) and \(a \in \mathcal{A}\), \(\mathcal{L}(Y_{n,a}) = \nu (a)\).

Proof

The law of \(Y_{n,a}\) is given by \(\mathcal{L}(Y_{n, a}) = \mathcal{L}(Y_{n, a} \mid T_{n, a}) \circ \mathcal{L}(T_{n, a})\). By Lemma 3.33, \(\mathcal{L}(Y_{n, a} \mid T_{n, a}) = \nu (a)\), a constant kernel. Thus the composition is just \(\nu (a)\).

3.4 Regret and other bandit quantities

Definition 3.35 Arm means

For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[\mathrm{id}]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).

Definition 3.36 Regret
#

The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:

\begin{align*} R_T = T \mu ^* - \sum _{t=0}^{T-1} \mu _{A_t} \: . \end{align*}
Definition 3.37
#

For an arm \(a \in \mathcal{A}\), its gap is defined as the difference between the mean of the best arm and the mean of that arm: \(\Delta _a = \mu ^* - \mu _a\).

Lemma 3.38
#

Let \(f : \mathcal{A} \to \mathbb {R}\) be a function on the arms. For all \(t \in \mathbb {N}\),

\begin{align*} \sum _{a \in \mathcal{A}} N_{t,a} f(a) = \sum _{s=0}^{t-1} f(A_s) \: . \end{align*}
Proof
\begin{align*} \sum _{a \in \mathcal{A}} N_{t,a} f(a) & = \sum _{a \in \mathcal{A}} \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} f(a) \\ & = \sum _{s=0}^{t-1} \sum _{a \in \mathcal{A}} \mathbb {I}\{ A_s = a\} f(a) \\ & = \sum _{s=0}^{t-1} f(A_s) \: . \end{align*}

For \(\mathcal{A}\) finite, the regret \(R_T\) can be expressed as a sum over the arms and their gaps:

\begin{align*} R_T = \sum _{a \in \mathcal{A}} N_{T,a} \Delta _a \: . \end{align*}
Proof

Apply Lemma 3.38 with \(f(a) = \Delta _a\) to obtain:

\begin{align*} \sum _{a \in \mathcal{A}} N_{T,a} \Delta _a & = \sum _{s=0}^{T-1} \Delta _{A_s} \\ & = \sum _{s=0}^{T-1} \mu ^* - \sum _{s=0}^{T-1} \mu _{A_s} \\ & = R_T \: . \end{align*}