3 Stochastic multi-armed bandits
3.1 Algorithm, bandit and probability space
A bandit algorithm is an algorithm in the sense of Definition 2.1. We call the actions arms and the observations rewards.
The first arm pulled by the algorithm is sampled from \(P_0\), the arm pulled at time \(1\) is sampled from \(\pi _0(H_0)\), where \(H_0 \in \mathcal{A} \times \mathcal{R}\) is the data of the first arm pulled and the first observation received, and so on.
A stochastic bandit is simply a reward distribution for each arm: a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), conditional distribution of the rewards given the arm pulled.
An algorithm can interact with a bandit to produce a sequence of arms and rewards: after a time \(t\), the history \(H_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\) contains the arms pulled and rewards received up to that time,
the algorithm chooses an arm \(A_{t+1}\) sampled according to its policy \(\pi _t(H_t)\),
the bandit generates a reward \(X_{t+1}\) according to the distribution \(\nu (A_{t+1})\),
the history is updated to \(H_{t+1} = ((A_0, X_0), \ldots , (A_{t+1}, X_{t+1}))\).
We now want to define a probability space on which we can study the sequences of arms and rewards, and formulate probabilistic statements about the interaction between the algorithm and the bandit.
By an application of the Ionescu-Tulcea theorem, an algorithm \((\pi , P_0)\) and bandit \(\nu \) together defines a probability distribution \(\mathbb {P}_B\) on the space \(\Omega _B := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We augment that probability space with a stream of rewards from each arm, independent of the bandit interaction, to get the probability space \((\Omega , \mathbb {P})\), where \(\Omega = \Omega _B \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\) and \(\mathbb {P} = \mathbb {P}_B \otimes (\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a))\).
TODO: explain how the probability distribution \(\mathbb {P}_B\) is constructed.
The reason for adding the extra stream of rewards is explained in Section 3.2.
For \(t \in \mathbb {N}\), we denote by \(A_t\) the arm pulled at time \(t\) and by \(X_t\) the reward received at time \(t\). Formally, these are measurable functions on \(\Omega = (\mathcal{A} \times \mathbb {R})^{\mathbb {N}} \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\), defined by \(A_t(\omega ) = \omega _{1,t,1}\) and \(X_t(\omega ) = \omega _{1,t,2}\). We denote by \(H_t \in (\mathcal{A} \times \mathbb {R})^{t+1}\) the history of pulls and rewards up to and including time \(t\), that is \(H_t = ((A_0, X_0), \ldots , (A_t, X_t))\). Formally, \(H_t(\omega ) = (\omega _{1,0}, \ldots , \omega _{1,t})\).
For an arm \(a \in \mathcal{A}\) and a time \(t \in \mathbb {N}\), we denote by \(N_{t,a}\) the number of times that arm \(a\) has been pulled before time \(t\), that is \(N_{t,a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} \).
When we describe an algorithm, we give the data of the policies \(\pi _t\), which are functions of the partial history up to time \(t\), in \((\mathcal{A} \times \mathcal{R})^{t+1}\). That means that any tool used to define a policy must be a function defined on \((\mathcal{A} \times \mathcal{R})^{t+1}\). For example a definition of the empirical mean of an arm must be a function \(t : \mathbb {N} \to (\mathcal{A} \times \mathcal{R})^{t+1} \to \mathbb {R}\).
When we analyze an algorithm, we work on the other hand on the bandit probability space \((\Omega , \mathbb {P})\), in which \(\Omega = (\mathcal{A} \times \mathcal{R})^{\mathbb {N}}\) is the full history, which describes the whole sequence of arms and rewards. As a stochastic process, the empirical mean of an arm is a function \(\mathbb {N} \to (\mathcal{A} \times \mathcal{R})^{\mathbb {N}} \to \mathbb {R}\).
Thus there are two very distinct types of objects: those defined on the partial history, which are used to build algorithms, and those defined on the full history, which are used to analyze algorithms.
The filtration \(\mathcal{F}_B\) on \(\Omega _B\) generated by the history of pulls and rewards is the increasing family of \(\sigma \)-algebras \(\mathcal{F}_{B,t} = \sigma (H_0, \ldots , H_t)\), for \(t \in \mathbb {N}\).
Seen as processes defined on \(\Omega _B\), the processes \((H_t)_{t \in \mathbb {N}}\), \((A_t)_{t \in \mathbb {N}}\), and \((X_t)_{t \in \mathbb {N}}\) are adapted to the filtration \(\mathcal{F}_B\).
Let \(a \in \mathcal{A}\). Seen as a process defined on \(\Omega _B\), \((N_{t,a})_{t \in \mathbb {N}}\) is predictable with respect to the filtration \(\mathcal{F}_B\) (that is, \((N_{t,a})_{t \in \mathbb {N}}\) is adapted to \((\mathcal{F}_{B, t-1})_{t \in \mathbb {N}}\)).
The conditional distribution of the reward \(X_t\) given the arm \(A_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\nu (A_t)\).
The law of the arm \(A_0\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(P_0\).
The conditional distribution of the arm \(A_{t+1}\) given the history \(H_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\pi _t(H_t)\).
3.2 Alternative model
The description of the bandit model above considers that at time \(t\), a reward \(X_t\) is generated, depending on the arm \(A_t\) pulled at that time. An alternative way to talk about that process is to imagine that there is a stream of rewards from each arm, and that the algorithm sees the first, then second, etc. reward from the arms at it pulls them. We introduce definitions to talk about the \(n^{th}\) reward of an arm, and the time at which that reward is pulled.
For an arm \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a}\) the time at which arm \(a\) was pulled for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the arm is not pulled \(n\) times.
For \(a \in \mathcal{A}\) and \(n \in \mathbb {N}\), let \(Z_{n,a} \sim \nu (a)\), independent of the bandit interaction and other \(Z_{m,b}\). In our probability space \(\Omega \), we can take for \(Z_{n,a}\) the function \(\omega \mapsto \omega _{2,n,a}\). We define \(Y_{n, a} = X_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when pulling arm \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise.
\(T_{n,a}\) is a stopping time for the filtration generated by the history of pulls and rewards.
It is the hitting time of a measurable set by the adapted process \((N_{n+1, a})_{n \in \mathbb {N}}\), hence a stopping time.
For \(n {\gt} 0\) and \(a \in \mathcal{A}\), the law of \(Y_{n,a}\) is \(\nu (a)\).
It suffices to show that for all \(t \in \mathbb {N} \cup \{ \infty \} \), the law of \(Y_{n,a}\) conditioned on \(T_{n,a} = t\) is \(\nu (a)\). If \(t = \infty \), then
If \(t {\lt} \infty \), then
TODO: explain that chain of equalities. There is independence involved.
The rewards \((Y_{n,a})_{n \in \mathbb {N}}\) are independent.
For \(a \in \mathcal{A}\), let \(Y^{(a)} = (Y_{n,a})_{n \in \mathbb {N}} \in \mathbb {R}^{\mathbb {N}}\) be the sequence of rewards obtained from pulling arm \(a\). Then the sequences \((Y^{(a)})_{a \in \mathcal{A}}\) are independent.
\(T_{N_{t+1, a}, a} \le t {\lt} \infty \) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).
By definition, \(T_{N_{t+1,a}, a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = N_{t+1,a}\} \le t {\lt} \infty \).
\(T_{N_{t+1, A_t}, A_t} = t\) for all \(t \in \mathbb {N}\).
\(Y_{N_{t+1, A_t}, A_t} = X_t\) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).
By Lemma 3.19, we have \(T_{N_{t+1,A_t}, A_t} = t {\lt} \infty \), so \(Y_{N_{t+1,A_t}, A_t} = X_{T_{N_{t+1,A_t}, A_t}} = X_t\).
3.3 Regret and other bandit quantities
For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[X]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).
The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:
For an arm \(a \in \mathcal{A}\), its gap is defined as the difference between the mean of the best arm and the mean of that arm: \(\Delta _a = \mu ^* - \mu _a\).
For \(\mathcal{A}\) finite, the regret \(R_T\) can be expressed as a sum over the arms and their gaps: