- Boxes
- definitions
- Ellipses
- theorems and lemmas
- Blue border
- the statement of this result is ready to be formalized; all prerequisites are done
- Orange border
- the statement of this result is not ready to be formalized; the blueprint needs more work
- Blue background
- the proof of this result is ready to be formalized; all prerequisites are done
- Green border
- the statement of this result is formalized
- Green background
- the proof of this result is formalized
- Dark green background
- the proof of this result and all its ancestors are formalized
- The statement of this result is in Mathlib border
- this is in Mathlib
Let \(\mathfrak {A}\) be an algorithm with action space \(\mathcal{A}\) and reward space \(\mathcal{R}\), policy \(\pi \) and initial distribution \(P_0\). For \(\mathcal{A}\) and \(\mathcal{R}\) standard Borel spaces, there exists jointly measurable functions \(f'_0 : I \to \mathcal{A}\) and \(f_t : (\mathcal{A} \times \mathcal{R})^{t+1} \times I \to \mathcal{A}\) such that
the law of \(f'_0\) is \(P_0\),
for all history \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), the law of \(f_t(h_t, \cdot )\) is \(\pi _t(h_t)\).
A sequential, stochastic algorithm with actions in a measurable space \(\mathcal{A}\) and observations in a measurable space \(\mathcal{R}\) is described by the following data:
for all \(t \in \mathbb {N}\), a policy \(\pi _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow \mathcal{A}\), a Markov kernel which gives the distribution of the action of the algorithm at time \(t+1\) given the history of previous actions and observations,
\(P_0 \in \mathcal{P}(\mathcal{A})\), a probability measure that gives the distribution of the first action.
The history, actions and rewards on the array model probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\) are defined as follows:
the action at time \(0\) is \(A_0(\omega ) = f'_0(\omega _{1,0})\), the reward at time \(0\) is \(R_0(\omega ) = \omega _{2,0,A_0(\omega )}\), and the history at time \(0\) is \(H_0(\omega ) = (A_0(\omega ), R_0(\omega ))\),
for \(t \ge 0\), the action at time \(t+1\) is \(A_{t+1}(\omega ) = f_t(H_t(\omega ), \omega _{1,t+1})\), the reward at time \(t+1\) is \(R_{t+1}(\omega ) = \omega _{2,N_{t+1,A_{t+1}(\omega )},A_{t+1}(\omega )}\), and the history at time \(t+1\) is \(H_{t+1}(\omega ) = (H_t(\omega ), (A_{t+1}(\omega ), R_{t+1}(\omega )))\).
We define the following functions on \(\Omega _{\mathcal{A}}\):
In the definition of \(F_{2, a, t}\) and \(F_{2, a}^m\), if \(N_{t+1,a}(\omega ) = 0\) (resp. \(m = 0\)), then the second component is a constant sequence equal to an arbitrary value.
\(F_{1, t}(\omega )\) contains all the information in \(\omega \) except for the action selection randomness after time \(t\).
\(F_{2, a, t}(\omega )\) contains all the information in \(\omega \) except the rewards for arm \(a\) indexed by \(N_{t+1,a}(\omega )\) or more. \(F_{2, a}^m(\omega )\) is similar, but removes the rewards for arm \(a\) indexed by \(m\) or more.
For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[\mathrm{id}]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).
Let \(I = [0,1]\) and let \(P_U\) be the uniform distribution on \(I\). We define the probability space \((\Omega _{\mathcal{A}}, P_{\mathcal{A}})\), where
A stochastic bandit is simply a reward distribution for each arm: a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), conditional distribution of the rewards given the arm pulled. It is a stationary environment in which the observation space is \(\mathcal{R} = \mathbb {R}\).
As in Definition 2.13, an algorithm \((\pi , P_0)\) and bandit \(\nu \) together defines a probability distribution \(\mathbb {P}_{\mathcal{T}}\) on the space \(\Omega _{\mathcal{T}} := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We augment that probability space with a stream of rewards from each arm, independent of the bandit interaction, to get the probability space \((\Omega , \mathbb {P})\), where \(\Omega = \Omega _{\mathcal{T}} \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\) and \(\mathbb {P} = \mathbb {P}_{\mathcal{T}} \otimes (\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a))\).
Let \(\hat{\mu }_{t, a} = \frac{S_{t, a}}{N_{t, a}} = \frac{1}{N_{t,a}} \sum _{s=0}^{t-1} R_s \mathbb {I}\{ A_s = a\} \) if \(N_{t, a} {\gt} 0\), and \(\hat{\mu }_{t, a} = 0\) otherwise. This is the empirical mean of the rewards obtained by choosing action \(a\) before time \(t\).
An environment with which an algorithm interacts is described by the following data:
for all \(t \in \mathbb {N}\), a feedback \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel which gives the distribution of the observation at time \(t+1\) given the history of previous pulls and observations, and the action of the algorithm at time \(t+1\),
\(\nu '_0 \in \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel that gives the distribution of the first observation given the first action.
The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:
for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),
compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),
for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).
For two sequences of random variables \(A : \mathbb {N} \to \Omega \to \mathcal{A}\) and \(R : \mathbb {N} \to \Omega \to \mathcal{R}\) (actions and observations), we call step of the interaction at time \(t\) the random variable \(X_t : \Omega \to \mathcal{A} \times \mathcal{R}\) defined by \(X_t(\omega ) = (A_t(\omega ), R_t(\omega ))\). We call history up to time \(t\) the random variable \(H_t : \Omega \to (\mathcal{A} \times \mathcal{R})^{t+1}\) defined by \(H_t(\omega ) = (X_0(\omega ), \ldots , X_t(\omega ))\).
Let \(\mathfrak {A}\) be an algorithm as in Definition 2.1 and \(\mathfrak {E}\) be an environment as in Definition 2.2. A probability space \((\Omega , P)\) and two sequences of random variables \(A : \mathbb {N} \to \Omega \to \mathcal{A}\) and \(R : \mathbb {N} \to \Omega \to \mathcal{R}\) form an algorithm-environment interaction for \(\mathfrak {A}\) and \(\mathfrak {E}\) if the following conditions hold:
The law of \(A_0\) is \(P_0\).
\(P \left[ R_0 \mid A_0 \right] = \nu '_0\).
For all \(t \in \mathbb {N}\), \(P\left[A_{t+1} \mid A_0, R_0, \ldots , A_t, R_t \right] = \pi _t\).
For all \(t \in \mathbb {N}\), \(P\left[R_{t+1} \mid A_0, R_0, \ldots , A_t, R_t, A_{t+1}\right] = \nu _t\).
For an algorithm-environment interaction \((A, R, P)\) as in Definition 2.5, we denote by \(\mathcal{F}_t\) the sigma-algebra generated by the history up to time \(t\): \(\mathcal{F}_t = \sigma (H_t)\). We denote by \(\mathcal{F}^A_t\) the sigma-algebra generated by the history up to time \(t-1\) and the action at time \(t\): \(\mathcal{F}^A_t = \sigma (H_{t-1}, A_t)\).
We write \(A_t\) and \(R_t\) for the projections of \(X_t\) on \(\mathcal{A}\) and \(\mathcal{R}\) respectively. \(A_t\) is the action taken at time \(t\) and \(R_t\) is the reward received at time \(t\). Formally, \(A_t(\omega ) = \omega _{t,1}\) and \(R_t(\omega ) = \omega _{t,2}\) for \(\omega = \prod _{t=0}^{+\infty }(\omega _{t,1}, \omega _{t,2}) \in \Omega _{\mathcal{T}} = \prod _{t=0}^{+\infty } \mathcal{A} \times \mathcal{R}\).
For \(t \in \mathbb {N}\), we denote by \(X_t \in \Omega _t\) the random variable describing the time step \(t\), and by \(H_t \in \prod _{s=0}^t \Omega _s\) the history up to time \(t\). Formally, these are measurable functions on \(\Omega _{\mathcal{T}}\), defined by \(X_t(\omega ) = \omega _t\) and \(H_t(\omega ) = (\omega _1, \ldots , \omega _t)\).
The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:
We define \(Y_{n, a} = R_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when choosing action \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise. In that expression, we see \(R_{T_{n,a}}\) and \(T_{n, a}\) as random variables on \(\Omega \) instead of \(\Omega _{\mathcal{T}}\).
An environment is stationary if there exists a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathcal{R}\) such that \(\nu '_0 = \nu \) and for all \(t \in \mathbb {N}\), for all \(h_t \in (\mathcal{A} \times \mathcal{R})^{t+1}\), for all \(a \in \mathcal{A}\), \(\nu _t(h_t, a) = \nu (a)\).
For an action \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a} \in \mathbb {N} \cup \{ +\infty \} \) the time at which action \(a\) was chosen for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the action is not chosen \(n\) times.
For \(\mu \in \mathcal{P}(\Omega _0)\), we call trajectory measure the probability measure \(\xi _0 \circ \mu \) on \(\Omega _{\mathcal{T}} := \prod _{i=0}^{\infty } \Omega _i\). We denote it by \(P_{\mathcal{T}}\). The \(\mathcal{T}\) subscript stands for “trajectory”.
The UCB algorithm with parameter \(c \in \mathbb {R}_+\) is defined as follows:
for \(t {\lt} K\), \(A_t = t \mod K\) (pull each arm once),
for \(t \ge K\), \(A_t = \arg \max _{a \in [K]} \left( \hat{\mu }_{t,a} + \sqrt{\frac{c \log (t + 1)}{N_{t,a}}} \right)\), where \(\hat{\mu }_{t,a} = \frac{1}{N_{t,a}} \sum _{s=0}^{t-1} \mathbb {I}(A_s = a) X_s\) is the empirical mean of the rewards for arm \(a\).
Let \(\omega , \omega ' \in \Omega _{\mathcal{A}}\) and \(t \in \mathbb {N}\). Suppose that
Then \(H_t(\omega ) = H_t(\omega ')\).
\(H_t\), \(N_{t,A_t}\), \(A_t\) and \(R_t\) are measurable for all \(t \in \mathbb {N}\).
For all \(t \in \mathbb {N}\), \(H_t\) is measurable with respect to the sigma-algebra generated by \(F_{1, t}\), and with respect to the sigma-algebra generated by \(F_{2, a, t}\) for any arm \(a \in \mathcal{A}\).
Let \(\omega , \omega ' \in \Omega _{\mathcal{A}}\), \(t, m \in \mathbb {N}\) and \(a \in \mathcal{A}\). Suppose that
Then \((N_{t+1, a}(\omega ) = m \wedge A_{t+1}(\omega ) = a) \iff (N_{t+1, a}(\omega ') = m \wedge A_{t+1}(\omega ') = a)\).
For a random variable \(X\) on a countable space with the discrete sigma algebra, \(\mathcal{L}(Y \mid X) = (x \mapsto \mathcal{L}(Y \mid X = x))\), \((X_*P)\)-almost surely. Furthermore, that almost sure equality means that for all \(x\) such that \(P(X = x) {\gt} 0\), we have \(\mathcal{L}(Y \mid X = x) = \mathcal{L}(Y \mid X)(x)\).
If \(X \perp \! \! \! \! \perp Y \mid Z, W\) and \(X \perp \! \! \! \! \perp Z \mid W\), then \(X \perp \! \! \! \! \perp (Y, Z) \mid W\).
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c {\gt} 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, we have
Suppose that we have the 3 following conditions:
\(\mu ^* \le \hat{\mu }_{t, a^*} + \sqrt{\frac{c \log (t + 1)}{N_{t,a^*}}}\),
\(\hat{\mu }_{t,A_t} - \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}} \le \mu _{A_t}\),
\(\hat{\mu }_{t, a^*} + \sqrt{\frac{c \log (t + 1)}{N_{t,a^*}}} \le \hat{\mu }_{t,A_t} + \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}}\).
Then if \(N_{t,A_t} {\gt} 0\) we have
And in turn, if \(\Delta _{A_t} {\gt} 0\) we get
Note that the third condition is always satisfied for UCB by Lemma 5.7, but this lemma, as stated, is independent of the UCB algorithm.
If \(X \perp \! \! \! \! \perp Y \mid Z\) and \(X \perp \! \! \! \! \perp Z\), then \(X \perp \! \! \! \! \perp (Y, Z)\).
The random variables \(A_t\) and \(R_t\) are \(\mathcal{F}_t\)-measurable. Said differently, the processes \((A_t)_{t \in \mathbb {N}}\) and \((R_t)_{t \in \mathbb {N}}\) are adapted to the filtration \((\mathcal{F}_t)_{t \in \mathbb {N}}\).
The random variables \(X_t\) and \(H_t\) are \(\mathcal{F}_t\)-measurable. Said differently, the processes \((X_t)_{t \in \mathbb {N}}\) and \((H_t)_{t \in \mathbb {N}}\) are adapted to the filtration \((\mathcal{F}_t)_{t \in \mathbb {N}}\).
Let \(X_1, \ldots , X_n\) be random variables such that \(X_i - P[X_i]\) is \(\sigma _{X,i}^2\)-sub-Gaussian for \(i \in [n]\). Let \(Y_1, \ldots , Y_m\) be random variables such that \(Y_i - P[Y_i]\) is \(\sigma _{Y,i}^2\)-sub-Gaussian for \(i \in [m]\). Suppose further that the vectors \(X\) and \(Y\) are independent and that \(\sum _{i = 1}^m P[Y_i] \le \sum _{i = 1}^n P[X_i]\). Then
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) with \(\Delta _a {\gt} 0\), we have \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\).
In the array model, for \(t \in \mathbb {N}\), \(a \in \mathcal{A}\), and a measurable set \(B \subseteq \mathbb {N} \times \mathbb {R}\),
As a consequence, this also holds for any algorithm-environment sequence.
Let \(\nu (a)\) be a 1-sub-Gaussian distribution on \(\mathbb {R}\) for each arm \(a \in \mathcal{A}\). Let \(c \ge 0\) be a real number and \(k\) a positive natural number. Then
The same upper bound holds for the upper tail:
In the array model,
As a consequence, this also holds for any algorithm-environment sequence.
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Let \(c \ge 0\) be a real number. Then for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have
And also,
Let \(\nu (a)\) be a 1-sub-Gaussian distribution on \(\mathbb {R}\) for each arm \(a \in \mathcal{A}\).
We note the following basic properties of \(N_{t,a}\):
\(N_{0,a} = 0\).
\(N_{t,a}\) is non-decreasing in \(t\).
\(N_{t + 1, A_t} = N_{t, A_t} + 1\) and for \(a \ne A_t\), \(N_{t + 1, a} = N_{t, a}\).
\(N_{t, a} \le t\).
If for all \(s \le t\), \(A_s(\omega ) = A_s(\omega ')\), then \(N_{t+1, a}(\omega ) = N_{t+1, a}(\omega ')\).
For \(C\) a natural number, for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have
We note the following basic properties of \(T_{n,a}\):
\(T_{0,a} = 0\) for \(a \ne A_0\). \(T_{0,A_0} = \infty \).
\(T_{N_{t+1, a}, a} \le t\).
\(T_{N_{t + 1, A_t}, A_t} = t\).
If \(T_{n, a} \ne \infty \) and \(n {\gt} 0\), then \(A_{T_{n, a}} = a\).
If \(T_{n, a} \ne \infty \), then \(N_{T_{n, a} + 1, a} = n\).
If \(T_{n, a} \ne \infty \) and \(n {\gt} 0\), then \(N_{T_{n, a}, a} = n - 1\).
If for all \(s \le t\), \(A_s(\omega ) = A_s(\omega ')\), then \(T_{n, a}(\omega ) = t \iff T_{n, a}(\omega ') = t\).
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c {\gt} 0\), for any time \(n \in \mathbb {N}\), we have
Let \(X_1, \ldots , X_n\) be independent random variables such that \(X_i\) is \(\sigma _i^2\)-sub-Gaussian for \(i \in [n]\). Then for any \(t \ge 0\),
Let \((\Omega _t)_{t \in \mathbb {N}}\) be a family of measurable spaces. Let \((\kappa _t)_{t \in \mathbb {N}}\) be a family of Markov kernels such that for any \(t\), \(\kappa _t\) is a kernel from \(\prod _{i=0}^t \Omega _{i}\) to \(\Omega _{t+1}\). Then there exists a unique Markov kernel \(\xi : \Omega _0 \rightsquigarrow \prod _{i = 1}^{\infty } \Omega _{i}\) such that for any \(n \ge 1\), \(\pi _{[1,n]*} \xi = \kappa _0 \otimes \ldots \otimes \kappa _{n-1}\). Here \(\pi _{[1,n]} : \prod _{i=1}^{\infty } \Omega _i \to \prod _{i=1}^n \Omega _i\) is the projection on the first \(n\) coordinates.
In the probability space \((\Omega _{\mathcal{T}}, P_{\mathcal{T}})\) constructed from an algorithm \(\mathfrak {A}\) and an environment \(\mathfrak {E}\) as above, the sequences of random variables \(A : \mathbb {N} \to \Omega _{\mathcal{T}} \to \mathcal{A}\) and \(R : \mathbb {N} \to \Omega _{\mathcal{T}} \to \mathcal{R}\) form an algorithm-environment interaction for \(\mathfrak {A}\) and \(\mathfrak {E}\).
If \((A, R, P)\) and \((A', R', P')\) are two algorithm-environment interactions for the same algorithm \(\mathfrak {A}\) and environment \(\mathfrak {E}\), then the joint distributions of the sequences of actions and observations are equal: the law of \((A_i, R_i)_{i \in \mathbb {N}}\) under \(P\) is equal to the law of \((A'_i, R'_i)_{i \in \mathbb {N}}\) under \(P'\).
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by