3 Stochastic multi-armed bandits
3.1 Algorithm, bandit and probability space
A bandit algorithm is an algorithm in the sense of Definition 2.1. We call the actions arms and the observations rewards.
The first arm pulled by the algorithm is sampled from \(P_0\), the arm pulled at time \(1\) is sampled from \(\pi _0(H_0)\), where \(H_0 \in \mathcal{A} \times \mathcal{R}\) is the data of the first arm pulled and the first observation received, and so on.
A stochastic bandit is simply a reward distribution for each arm: a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), conditional distribution of the rewards given the arm pulled. It is a stationary environment in which the observation space is \(\mathcal{R} = \mathbb {R}\).
An algorithm can interact with a bandit to produce a sequence of arms and rewards: after a time \(t\), the history \(H_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\) contains the arms pulled and rewards received up to that time,
the algorithm chooses an arm \(A_{t+1}\) sampled according to its policy \(\pi _t(H_t)\),
the bandit generates a reward \(R_{t+1}\) according to the distribution \(\nu (A_{t+1})\),
the history is updated to \(H_{t+1} = ((A_0, R_0), \ldots , (A_{t+1}, R_{t+1}))\).
We now want to define a probability space on which we can study the sequences of arms and rewards, and formulate probabilistic statements about the interaction between the algorithm and the bandit.
As in Definition 2.13, an algorithm \((\pi , P_0)\) and bandit \(\nu \) together defines a probability distribution \(\mathbb {P}_{\mathcal{T}}\) on the space \(\Omega _{\mathcal{T}} := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We augment that probability space with a stream of rewards from each arm, independent of the bandit interaction, to get the probability space \((\Omega , \mathbb {P})\), where \(\Omega = \Omega _{\mathcal{T}} \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\) and \(\mathbb {P} = \mathbb {P}_{\mathcal{T}} \otimes (\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a))\).
3.2 The array model of rewards
We previously built a probability space on which we can define the sequence of arms and rewards generated by the interaction between the algorithm and the bandit, using the Ionescu-Tulcea theorem. From Theorem 2.9, we know that the law of the sequence of arms and rewards is independent of the probability space used to define them. Nonetheless, we now build an alternative model of the rewards, on which it will be easier to prove concentration inequalities. By the uniqueness of the law, these statements will then transfer to any algorithm-environment interaction.
In the “array model”, we consider a probability space on which we have an infinite array of rewards from each arm, independent from each other. When pulling an arm, the algorithm sees the next previously unseen reward from that arm in the array.
Let \(I = [0,1]\) and let \(P_U\) be the uniform distribution on \(I\). We define the probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\), where
Let \(\mathfrak {A}\) be an algorithm with action space \(\mathcal{A}\) and reward space \(\mathcal{R}\), policy \(\pi \) and initial distribution \(P_0\). For \(\mathcal{A}\) and \(\mathcal{R}\) standard Borel spaces, there exists jointly measurable functions \(f'_0 : I \to \mathcal{A}\) and \(f_t : (\mathcal{A} \times \mathcal{R})^{t+1} \times I \to \mathcal{A}\) such that
the law of \(f'_0\) is \(P_0\),
for all history \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), the law of \(f_t(h_t, \cdot )\) is \(\pi _t(h_t)\).
The history, actions and rewards on the array model probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\) are defined as follows:
the action at time \(0\) is \(A_0(\omega ) = f'_0(\omega _{1,0})\), the reward at time \(0\) is \(R_0(\omega ) = \omega _{2,0,A_0(\omega )}\), and the history at time \(0\) is \(H_0(\omega ) = (A_0(\omega ), R_0(\omega ))\),
for \(t \ge 0\), the action at time \(t+1\) is \(A_{t+1}(\omega ) = f_t(H_t(\omega ), \omega _{1,t+1})\), the reward at time \(t+1\) is \(R_{t+1}(\omega ) = \omega _{2,N_{t+1,A_{t+1}(\omega )},A_{t+1}(\omega )}\), and the history at time \(t+1\) is \(H_{t+1}(\omega ) = (H_t(\omega ), (A_{t+1}(\omega ), R_{t+1}(\omega )))\).
The goal of this section is to show that \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\) with the actions and rewards defined above is an algorithm-environment sequence as in Definition 2.5.
3.2.1 Measurability
TODO: some of those results are proved for \(\mathcal{A}\) countable. Add that assumption where needed.
In this section, we often need to prove that a random variable \(X\) is measurable with respect to the sigma-algebra generated by a subset of the independent random variables defining the probability space \(\Omega _{\mathcal{A}}\). We want to prove that \(X : \Omega _{\mathcal{A}} \to \mathcal{X}\) is measurable with respect to the sigma-algebra \(\sigma ((\omega _p)_{p \in S})\), where \(S\) is a subset of the indices of the independent random variables defining \(\Omega _{\mathcal{A}}\). In most cases, this is due to \(X\) being defined (possibly recursively in a complicated way) only using those random variables. However, it might be difficult to exhibit an explicit function \(f\) such that \(X = f((\omega _p)_{p \in S})\).
Here is a general strategy to prove such measurability results in Lean:
Prove that \(X\) is measurable with respect to the full sigma-algebra on \(\Omega _{\mathcal{A}}\).
Prove a congruence lemma: for any \(\omega , \omega ' \in \Omega _{\mathcal{A}}\), if \(\omega _p = \omega '_p\) for all \(p \in S\), then \(X(\omega ) = X(\omega ')\).
Define \(g : (\prod _{p \in S} \Omega _p) \to \Omega \) by \(g((\omega _p)_{p \in S}) = \omega '\), where \(\omega '_p = \omega _p\) for \(p \in S\) and \(\omega '_p\) is some fixed value for \(p \notin S\).
Write \(X = X \circ g \circ \mathrm{proj}_S\), where \(\mathrm{proj}_S : \Omega _{\mathcal{A}} \to \prod _{p \in S} \Omega _p\) is the projection on the coordinates in \(S\).
Conclude that \(X\) is the composition of a measurable function \((X \circ g)\) and the random variable generating the sub-sigma-algebra, and thus is measurable with respect to that sub-sigma-algebra.
\(H_t\), \(N_{t,A_t}\), \(A_t\) and \(R_t\) are measurable for all \(t \in \mathbb {N}\).
Let \(\omega , \omega ' \in \Omega _{\mathcal{A}}\) and \(t \in \mathbb {N}\). Suppose that
Then \(H_t(\omega ) = H_t(\omega ')\).
Let \(\omega , \omega ' \in \Omega _{\mathcal{A}}\), \(t, m \in \mathbb {N}\) and \(a \in \mathcal{A}\). Suppose that
Then \((N_{t+1, a}(\omega ) = m \wedge A_{t+1}(\omega ) = a) \iff (N_{t+1, a}(\omega ') = m \wedge A_{t+1}(\omega ') = a)\).
We define the following functions on \(\Omega _{\mathcal{A}}\):
In the definition of \(F_{2, a, t}\) and \(F_{2, a}^m\), if \(N_{t+1,a}(\omega ) = 0\) (resp. \(m = 0\)), then the second component is a constant sequence equal to an arbitrary value.
\(F_{1, t}(\omega )\) contains all the information in \(\omega \) except for the action selection randomness after time \(t\).
\(F_{2, a, t}(\omega )\) contains all the information in \(\omega \) except the rewards for arm \(a\) indexed by \(N_{t+1,a}(\omega )\) or more. \(F_{2, a}^m(\omega )\) is similar, but removes the rewards for arm \(a\) indexed by \(m\) or more.
For all \(t \in \mathbb {N}\), \(H_t\) is measurable with respect to the sigma-algebra generated by \(F_{1, t}\), and with respect to the sigma-algebra generated by \(F_{2, a, t}\) for any arm \(a \in \mathcal{A}\).
\(A_{t+1}\) is measurable with respect to the sigma-algebra generated by \(F_{2, a, t}\) for any arm \(a \in \mathcal{A}\).
\(N_{t+1,a}\) is measurable with respect to the sigma-algebra generated by \(F_{2, a, t}\).
For \(t, m \in \mathbb {N}\) and \(a \in \mathcal{A}\), the indicator function \(\mathbb {I}\{ N_{t+1,a} = m \wedge A_{t+1} = a\} : \Omega _{\mathcal{A}} \to \{ 0, 1\} \) is measurable with respect to the sigma-algebra generated by \(F_{2, a}^m\).
For \(t \in \mathbb {N}\), the function \(N_{t+1, A_{t+1}}\) is measurable with respect to the sigma-algebra generated by \(H_t\) and \(A_{t+1}\).
3.2.2 Independence
\(\omega \mapsto \omega _{1, t+1}\) is independent of \(F_{1, t}\).
\(\omega \mapsto \omega _{1, t+1}\) is independent of \(H_t\).
For \(a \in \mathcal{A}\) and \(m \in \mathbb {N}\), \(\omega \mapsto \omega _{2, m, a}\) is independent of \(F_{2, a}^m\).
For \(a \in \mathcal{A}\), \(\omega \mapsto \omega _{2, m, a}\) is independent of the indicator function \(\mathbb {I}\{ N_{t+1,a} = m \wedge A_{t+1} = a\} \).
For \(a \in \mathcal{A}\) and \(t, m \in \mathbb {N}\), \(\omega \mapsto \omega _{2, m, a}\) is independent of \(H_t\) given that \(N_{t+1,a} = m\) and \(A_{t+1} = a\).
3.2.3 Laws
The law of \(A_0\) in the array model is \(P_0\).
In the array model, \(P_{\mathcal{A}}[R_0 \mid A_0] = \nu \).
In the array model, \(P_{\mathcal{A}}[A_{t+1} \mid H_t] = \pi _t\).
In the array model, \(P_{\mathcal{A}}[R_{t+1} \mid N_{t+1,A_{t+1}}, A_{t+1}] = \nu \).
In the array model, \(P_{\mathcal{A}}[R_{t+1} \mid H_t, A_{t+1}, N_{t+1,A_{t+1}}] = \nu \).
For \(t \ge 0\), \(R_{t+1} \perp \! \! \! \! \perp H_t \mid A_{t+1}, N_{t+1, A_{t+1}}\).
In the array model, \(P_{\mathcal{A}}[R_{t+1} \mid H_t, A_{t+1}] = \nu \).
The actions and rewards defined on the array model probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\) form an algorithm-environment sequence for the algorithm \(\mathfrak {A}\) and bandit \(\nu \).
3.3 Law of the nth pull
This section describes the law of \(Y_{n, a}\) the \(n^{th}\) reward obtained from an arm \(a \in \mathcal{A}\) (Definition 2.33).
We augment the probability space \(\Omega \) on which we have an algorithm-environment sequence with \(\Omega ' = \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\), on which we put the product measure \(\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a)\). With that measure, the law of \(Z_{n,a}\) is \(\nu (a)\).
The function \(\mathbb {I}\{ T_{n,a} = t\} : \Omega \to \{ 0, 1\} \) is measurable with respect to the sigma-algebra generated by \((H_{t-1}, A_t)\).
For \(t {\gt} 0\), \(R_t \perp \! \! \! \! \perp \mathbb {I}\{ T_{n, a} = t\} \mid A_t\).
\(\mathbb {I}\{ T_{n, a} = t\} \) is measurable with respect to the sigma-algebra generated by \((H_{t-1}, A_t)\) by Lemma 3.29.
It thus suffices to show that \(R_t \perp \! \! \! \! \perp (H_{t-1}, A_t) \mid A_t\), which is implied by \(R_t \perp \! \! \! \! \perp H_{t-1} \mid A_t\) (Lemma A.1), which is Lemma 2.11.
Let \(n {\gt} 0\), \(t \in \mathbb {N}\) and suppose that \(P(T_{n, a} = t) {\gt} 0\). Then \(\mathcal{L}(R_t \mid T_{n, a} = t) = \nu (a)\).
First, if \(T_{n, a} = t\), then \(A_t = a\) (Lemma 2.31), such that \(\mathcal{L}(R_t \mid T_{n, a} = t) = \mathcal{L}(R_t \mid T_{n, a} = t, A_t = a)\).
Then, using first the independence from Lemma 3.30 and then the conditional distribution from Lemma 2.10, we have
For a random variable \(X\) on a countable space with the discrete sigma algebra, \(\mathcal{L}(Y \mid X) = (x \mapsto \mathcal{L}(Y \mid X = x))\), \((X_*P)\)-almost surely. Furthermore, that almost sure equality means that for all \(x\) such that \(P(X = x) {\gt} 0\), we have \(\mathcal{L}(Y \mid X = x) = \mathcal{L}(Y \mid X)(x)\).
For \(n {\gt} 0\) and \(t \in \mathbb {N}\), \(\mathcal{L}(Y_{n,a} \mid T_{n,a}) = \nu (a)\) (in which the measure on the r.h.s. is seen as a constant kernel).
It suffices to show that for all \(t \in \mathbb {N} \cup \{ \infty \} \) such that \(\mathbb {P}(T_{n, a} = t) {\gt} 0\), the law of \(Y_{n,a}\) conditioned on \(T_{n,a} = t\) is \(\nu (a)\).
If \(t {\lt} \infty \), then \(\mathcal{L}(Y_{n, a} \mid T_{n, a} = t) = \mathcal{L}(R_t \mid T_{n, a} = t) = \nu (a)\) by Lemma 3.31.
If \(t = \infty \), then \(\mathcal{L}(Y_{n, a} \mid T_{n, a} = \infty ) = \mathcal{L}(Z_{n, a} \mid T_{n, a} = \infty )\). By independence of \(Z_{n,a}\) and \(T_{n, a}\), this is just \(\nu (a)\), the law of \(Z_{n,a}\).
For \(n {\gt} 0\) and \(a \in \mathcal{A}\), \(\mathcal{L}(Y_{n,a}) = \nu (a)\).
The law of \(Y_{n,a}\) is given by \(\mathcal{L}(Y_{n, a}) = \mathcal{L}(Y_{n, a} \mid T_{n, a}) \circ \mathcal{L}(T_{n, a})\). By Lemma 3.33, \(\mathcal{L}(Y_{n, a} \mid T_{n, a}) = \nu (a)\), a constant kernel. Thus the composition is just \(\nu (a)\).
3.4 Regret and other bandit quantities
For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[\mathrm{id}]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).
The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:
For an arm \(a \in \mathcal{A}\), its gap is defined as the difference between the mean of the best arm and the mean of that arm: \(\Delta _a = \mu ^* - \mu _a\).
Let \(f : \mathcal{A} \to \mathbb {R}\) be a function on the arms. For all \(t \in \mathbb {N}\),
For \(\mathcal{A}\) finite, the regret \(R_T\) can be expressed as a sum over the arms and their gaps:
Apply Lemma 3.38 with \(f(a) = \Delta _a\) to obtain: