2 Stochastic multi-armed bandits

2.1 Bandit model and probability space

Definition 1 Bandit

✓

The interaction of an algorithm with a stochastic bandit with arms in \(\mathcal{A}\) (a measurable space) and real rewards is described by the following data:

\(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), a Markov kernel, conditional distribution of the rewards given the arm pulled,
for all \(t \in \mathbb {N}\), a policy \(\pi _t : (\mathcal{A} \times \mathbb {R})^{t+1} \rightsquigarrow \mathcal{A}\), a Markov kernel which gives the distribution of the arm to pull at time \(t+1\) given the history of previous pulls and rewards,
\(P_0 \in \mathcal{P}(\mathcal{A})\), a probability measure that gives the distribution of the first arm to pull.

Definition 2 Bandit probability space

✓

By an application of the Ionescu-Tulcea theorem, a bandit \(\mathcal{B} = (\nu , \pi , P_0)\) defines a probability distribution on the space \(\Omega := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We denote that distribution by \(\mathbb {P}\). TODO: explain how the probability distribution is constructed.

Definition 3 Arms, rewards and history

✓

For \(t \in \mathbb {N}\), we denote by \(A_t\) the arm pulled at time \(t\) and by \(X_t\) the reward received at time \(t\). Formally, these are measurable functions on \(\Omega = (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), defined by \(A_t(\omega ) = \omega _{t,1}\) and \(X_t(\omega ) = \omega _{t,2}\). We denote by \(H_t \in (\mathcal{A} \times \mathbb {R})^{t+1}\) the history of pulls and rewards up to and including time \(t\), that is \(H_t = ((A_0, X_0), \ldots , (A_t, X_t))\). Formally, \(H_t(\omega ) = (\omega _0, \ldots , \omega _t)\).

Lemma 4

The conditional distribution of the reward \(X_t\) given the arm \(A_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\nu (A_t)\).

Proof ▶

Lemma 5

The law of the arm \(A_0\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(P_0\).

Proof ▶

Lemma 6

The conditional distribution of the arm \(A_{t+1}\) given the history \(H_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\pi _t(H_t)\).

Proof ▶

2.2 Regret and other bandit quantities

Definition 7

✓

For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[X]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).

Definition 8 Regret

✓

The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:

\begin{align*} R_T = T \mu ^* - \sum _{t=0}^{T-1} \mu _{A_t} \: . \end{align*}

Definition 9

✓

For an arm \(a \in \mathcal{A}\), its gap is defined as the difference between the mean of the best arm and the mean of that arm: \(\Delta _a = \mu ^* - \mu _a\).

Definition 10

✓

For an arm \(a \in \mathcal{A}\) and a time \(t \in \mathbb {N}\), we denote by \(N_{t,a}\) the number of times that arm \(a\) has been pulled before time \(t\), that is \(N_{t,a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} \).

Lemma 11

✓

For \(\mathcal{A}\) finite, the regret \(R_T\) can be expressed as a sum over the arms and their gaps:

\begin{align*} R_T = \sum _{a \in \mathcal{A}} N_{T,a} \Delta _a \: . \end{align*}

Proof ▶

\begin{align*} R_T = T \mu ^* - \sum _{t=0}^{T-1} \mu _{A_t} & = T \mu ^* - \sum _{a \in \mathcal{A}} \sum _{t=0}^{T-1} \mathbb {I}\{ A_t = a\} \mu _a \\ & = T \mu ^* - \sum _{a \in \mathcal{A}} N_{T,a} \mu _a \\ & = \sum _{a \in \mathcal{A}} N_{T,a} \mu ^* - \sum _{a \in \mathcal{A}} N_{T,a} \mu _a \\ & = \sum _{a \in \mathcal{A}} N_{T,a} \Delta _a \: . \end{align*}

2.3 Alternative model

The description of the bandit model above considers that at time \(t\), a reward \(X_t\) is generated, depending on the arm \(A_t\) pulled at that time. An alternative way to talk about that process is to imagine that there is a stream of rewards from each arm, and that the algorithm sees the first, then second, etc. reward from the arms at it pulls them. We introduce definitions to talk about the \(n^{th}\) reward of an arm, and the time at which that reward is pulled.

Definition 12

For an arm \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a}\) the time at which arm \(a\) was pulled for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the arm is not pulled \(n\) times.

Definition 13

For \(a \in \mathcal{A}\) and \(n \in \mathbb {N}\), let \(Z_{n,a} \sim \nu (a)\), independent of everything else. We define \(Y_{n, a} = X_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when pulling arm \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise.

TODO: that definition requires changing the probability space to \(\Omega \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\).

Lemma 14

\(T_{N_{t, a}, a} \le t - 1 {\lt} \infty \) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).

Proof ▶

By definition, \(T_{N_{t,a}, a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = N_{t,a}\} \le t - 1 {\lt} \infty \).

Lemma 15

\(T_{N_{t+1, A_t}, A_t} = t\) for all \(t \in \mathbb {N}\).

Proof ▶

Lemma 16

\(Y_{N_{t+1,A_t}, A_t} = X_t\) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).

Proof ▶

By Lemma 15, we have \(T_{N_{t+1,A_t}, A_t} = t {\lt} \infty \), so \(Y_{N_{t+1,A_t}, A_t} = X_{T_{N_{t+1,A_t}, A_t}} = X_t\).

Lemma 17

\begin{align*} \sum _{n=1}^{N_{t, a}} Y_{n, a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} X_s \: . \end{align*}

Proof ▶