LeanBandits

3 Stochastic multi-armed bandits

3.1 Algorithm, bandit and probability space

A bandit algorithm is an algorithm in the sense of Definition 2.1. We call the actions arms and the observations rewards.

The first arm pulled by the algorithm is sampled from \(P_0\), the arm pulled at time \(1\) is sampled from \(\pi _0(H_0)\), where \(H_0 \in \mathcal{A} \times \mathcal{R}\) is the data of the first arm pulled and the first observation received, and so on.

Definition 3.1 Bandit

A stochastic bandit is simply a reward distribution for each arm: a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), conditional distribution of the rewards given the arm pulled.

An algorithm can interact with a bandit to produce a sequence of arms and rewards: after a time \(t\), the history \(H_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\) contains the arms pulled and rewards received up to that time,

  • the algorithm chooses an arm \(A_{t+1}\) sampled according to its policy \(\pi _t(H_t)\),

  • the bandit generates a reward \(X_{t+1}\) according to the distribution \(\nu (A_{t+1})\),

  • the history is updated to \(H_{t+1} = ((A_0, X_0), \ldots , (A_{t+1}, X_{t+1}))\).

We now want to define a probability space on which we can study the sequences of arms and rewards, and formulate probabilistic statements about the interaction between the algorithm and the bandit.

Definition 3.2 Bandit probability space
#

By an application of the Ionescu-Tulcea theorem, an algorithm \((\pi , P_0)\) and bandit \(\nu \) together defines a probability distribution \(\mathbb {P}_B\) on the space \(\Omega _B := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We augment that probability space with a stream of rewards from each arm, independent of the bandit interaction, to get the probability space \((\Omega , \mathbb {P})\), where \(\Omega = \Omega _B \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\) and \(\mathbb {P} = \mathbb {P}_B \otimes (\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a))\).

TODO: explain how the probability distribution \(\mathbb {P}_B\) is constructed.

The reason for adding the extra stream of rewards is explained in Section 3.2.

Definition 3.3 Arms, rewards and history
#

For \(t \in \mathbb {N}\), we denote by \(A_t\) the arm pulled at time \(t\) and by \(X_t\) the reward received at time \(t\). Formally, these are measurable functions on \(\Omega = (\mathcal{A} \times \mathbb {R})^{\mathbb {N}} \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\), defined by \(A_t(\omega ) = \omega _{1,t,1}\) and \(X_t(\omega ) = \omega _{1,t,2}\). We denote by \(H_t \in (\mathcal{A} \times \mathbb {R})^{t+1}\) the history of pulls and rewards up to and including time \(t\), that is \(H_t = ((A_0, X_0), \ldots , (A_t, X_t))\). Formally, \(H_t(\omega ) = (\omega _{1,0}, \ldots , \omega _{1,t})\).

Definition 3.4 Pull counts
#

For an arm \(a \in \mathcal{A}\) and a time \(t \in \mathbb {N}\), we denote by \(N_{t,a}\) the number of times that arm \(a\) has been pulled before time \(t\), that is \(N_{t,a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} \).

Remark 3.5 Building vs analyzing algorithms
#

When we describe an algorithm, we give the data of the policies \(\pi _t\), which are functions of the partial history up to time \(t\), in \((\mathcal{A} \times \mathcal{R})^{t+1}\). That means that any tool used to define a policy must be a function defined on \((\mathcal{A} \times \mathcal{R})^{t+1}\). For example a definition of the empirical mean of an arm must be a function \(t : \mathbb {N} \to (\mathcal{A} \times \mathcal{R})^{t+1} \to \mathbb {R}\).

When we analyze an algorithm, we work on the other hand on the bandit probability space \((\Omega , \mathbb {P})\), in which \(\Omega = (\mathcal{A} \times \mathcal{R})^{\mathbb {N}}\) is the full history, which describes the whole sequence of arms and rewards. As a stochastic process, the empirical mean of an arm is a function \(\mathbb {N} \to (\mathcal{A} \times \mathcal{R})^{\mathbb {N}} \to \mathbb {R}\).

Thus there are two very distinct types of objects: those defined on the partial history, which are used to build algorithms, and those defined on the full history, which are used to analyze algorithms.

Definition 3.6 Filtration
#

The filtration \(\mathcal{F}_B\) on \(\Omega _B\) generated by the history of pulls and rewards is the increasing family of \(\sigma \)-algebras \(\mathcal{F}_{B,t} = \sigma (H_0, \ldots , H_t)\), for \(t \in \mathbb {N}\).

Seen as processes defined on \(\Omega _B\), the processes \((H_t)_{t \in \mathbb {N}}\), \((A_t)_{t \in \mathbb {N}}\), and \((X_t)_{t \in \mathbb {N}}\) are adapted to the filtration \(\mathcal{F}_B\).

Proof

Let \(a \in \mathcal{A}\). Seen as a process defined on \(\Omega _B\), \((N_{t,a})_{t \in \mathbb {N}}\) is predictable with respect to the filtration \(\mathcal{F}_B\) (that is, \((N_{t,a})_{t \in \mathbb {N}}\) is adapted to \((\mathcal{F}_{B, t-1})_{t \in \mathbb {N}}\)).

Proof
Lemma 3.9

The conditional distribution of the reward \(X_t\) given the arm \(A_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\nu (A_t)\).

Proof
Lemma 3.10

The law of the arm \(A_0\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(P_0\).

Proof
Lemma 3.11

The conditional distribution of the arm \(A_{t+1}\) given the history \(H_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\pi _t(H_t)\).

Proof

3.2 Alternative model

The description of the bandit model above considers that at time \(t\), a reward \(X_t\) is generated, depending on the arm \(A_t\) pulled at that time. An alternative way to talk about that process is to imagine that there is a stream of rewards from each arm, and that the algorithm sees the first, then second, etc. reward from the arms at it pulls them. We introduce definitions to talk about the \(n^{th}\) reward of an arm, and the time at which that reward is pulled.

Definition 3.12
#

For an arm \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a}\) the time at which arm \(a\) was pulled for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the arm is not pulled \(n\) times.

Definition 3.13
#

For \(a \in \mathcal{A}\) and \(n \in \mathbb {N}\), let \(Z_{n,a} \sim \nu (a)\), independent of the bandit interaction and other \(Z_{m,b}\). In our probability space \(\Omega \), we can take for \(Z_{n,a}\) the function \(\omega \mapsto \omega _{2,n,a}\). We define \(Y_{n, a} = X_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when pulling arm \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise.

Lemma 3.14

\(T_{n,a}\) is a stopping time for the filtration generated by the history of pulls and rewards.

Proof

It is the hitting time of a measurable set by the adapted process \((N_{n+1, a})_{n \in \mathbb {N}}\), hence a stopping time.

Lemma 3.15
#

For \(n {\gt} 0\) and \(a \in \mathcal{A}\), the law of \(Y_{n,a}\) is \(\nu (a)\).

Proof

It suffices to show that for all \(t \in \mathbb {N} \cup \{ \infty \} \), the law of \(Y_{n,a}\) conditioned on \(T_{n,a} = t\) is \(\nu (a)\). If \(t = \infty \), then

\begin{align*} \mathcal{L}(Y_{n,a} \mid T_{n,a} = t) = \mathcal{L}(Z_{n,a} \mid T_{n,a} = t) = \mathcal{L}(Z_{n,a}) = \nu (a) \end{align*}

If \(t {\lt} \infty \), then

\begin{align*} \mathcal{L}(Y_{n,a} \mid T_{n,a} = t) & = \mathcal{L}(X_t \mid T_{n,a} = t) \\ & = \mathcal{L}(X_t \mid T_{n,a} = t, A_t = a) \\ & = \mathcal{L}(X_t \mid A_t = a) \\ & = \nu (a) \: . \end{align*}

TODO: explain that chain of equalities. There is independence involved.

Lemma 3.16
#

The rewards \((Y_{n,a})_{n \in \mathbb {N}}\) are independent.

Proof
Lemma 3.17

For \(a \in \mathcal{A}\), let \(Y^{(a)} = (Y_{n,a})_{n \in \mathbb {N}} \in \mathbb {R}^{\mathbb {N}}\) be the sequence of rewards obtained from pulling arm \(a\). Then the sequences \((Y^{(a)})_{a \in \mathcal{A}}\) are independent.

Proof
Lemma 3.18

\(T_{N_{t+1, a}, a} \le t {\lt} \infty \) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).

Proof

By definition, \(T_{N_{t+1,a}, a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = N_{t+1,a}\} \le t {\lt} \infty \).

Lemma 3.19

\(T_{N_{t+1, A_t}, A_t} = t\) for all \(t \in \mathbb {N}\).

Proof

\(Y_{N_{t+1, A_t}, A_t} = X_t\) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).

Proof

By Lemma 3.19, we have \(T_{N_{t+1,A_t}, A_t} = t {\lt} \infty \), so \(Y_{N_{t+1,A_t}, A_t} = X_{T_{N_{t+1,A_t}, A_t}} = X_t\).

Lemma 3.21
\begin{align*} \sum _{n=1}^{N_{t, a}} Y_{n, a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} X_s \: . \end{align*}
Proof

3.3 Regret and other bandit quantities

Definition 3.22

For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[X]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).

Definition 3.23 Regret
#

The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:

\begin{align*} R_T = T \mu ^* - \sum _{t=0}^{T-1} \mu _{A_t} \: . \end{align*}
Definition 3.24
#

For an arm \(a \in \mathcal{A}\), its gap is defined as the difference between the mean of the best arm and the mean of that arm: \(\Delta _a = \mu ^* - \mu _a\).

For \(\mathcal{A}\) finite, the regret \(R_T\) can be expressed as a sum over the arms and their gaps:

\begin{align*} R_T = \sum _{a \in \mathcal{A}} N_{T,a} \Delta _a \: . \end{align*}
Proof
\begin{align*} R_T = T \mu ^* - \sum _{t=0}^{T-1} \mu _{A_t} & = T \mu ^* - \sum _{a \in \mathcal{A}} \sum _{t=0}^{T-1} \mathbb {I}\{ A_t = a\} \mu _a \\ & = T \mu ^* - \sum _{a \in \mathcal{A}} N_{T,a} \mu _a \\ & = \sum _{a \in \mathcal{A}} N_{T,a} \mu ^* - \sum _{a \in \mathcal{A}} N_{T,a} \mu _a \\ & = \sum _{a \in \mathcal{A}} N_{T,a} \Delta _a \: . \end{align*}