- Boxes
- definitions
- Ellipses
- theorems and lemmas
- Blue border
- the statement of this result is ready to be formalized; all prerequisites are done
- Orange border
- the statement of this result is not ready to be formalized; the blueprint needs more work
- Blue background
- the proof of this result is ready to be formalized; all prerequisites are done
- Green border
- the statement of this result is formalized
- Green background
- the proof of this result is formalized
- Dark green background
- the proof of this result and all its ancestors are formalized
- The statement of this result is in Mathlib border
- this is in Mathlib
A sequential, stochastic algorithm with actions in a measurable space \(\mathcal{A}\) and observations in a measurable space \(\mathcal{R}\) is described by the following data:
for all \(t \in \mathbb {N}\), a policy \(\pi _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow \mathcal{A}\), a Markov kernel which gives the distribution of the action of the algorithm at time \(t+1\) given the history of previous pulls and observations,
\(P_0 \in \mathcal{P}(\mathcal{A})\), a probability measure that gives the distribution of the first action.
For \(t \in \mathbb {N}\), we denote by \(A_t\) the arm pulled at time \(t\) and by \(X_t\) the reward received at time \(t\). Formally, these are measurable functions on \(\Omega = (\mathcal{A} \times \mathbb {R})^{\mathbb {N}} \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\), defined by \(A_t(\omega ) = \omega _{1,t,1}\) and \(X_t(\omega ) = \omega _{1,t,2}\). We denote by \(H_t \in (\mathcal{A} \times \mathbb {R})^{t+1}\) the history of pulls and rewards up to and including time \(t\), that is \(H_t = ((A_0, X_0), \ldots , (A_t, X_t))\). Formally, \(H_t(\omega ) = (\omega _{1,0}, \ldots , \omega _{1,t})\).
For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[X]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).
A stochastic bandit is simply a reward distribution for each arm: a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), conditional distribution of the rewards given the arm pulled.
By an application of the Ionescu-Tulcea theorem, an algorithm \((\pi , P_0)\) and bandit \(\nu \) together defines a probability distribution \(\mathbb {P}_B\) on the space \(\Omega _B := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We augment that probability space with a stream of rewards from each arm, independent of the bandit interaction, to get the probability space \((\Omega , \mathbb {P})\), where \(\Omega = \Omega _B \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\) and \(\mathbb {P} = \mathbb {P}_B \otimes (\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a))\).
TODO: explain how the probability distribution \(\mathbb {P}_B\) is constructed.
An environment with which an algorithm interacts is described by the following data:
for all \(t \in \mathbb {N}\), a feedback \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel which gives the distribution of the observation at time \(t+1\) given the history of previous pulls and observations, and the action of the algorithm at time \(t+1\),
\(\nu '_0 \in \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel that gives the distribution of the first observation given the first action.
The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:
for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),
compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),
for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).
For \(t \in \mathbb {N}\), we denote by \(X_t \in \Omega _t\) the random variable describing the time step \(t\), and by \(H_t \in \prod _{s=0}^t \Omega _s\) the history up to time \(t\). Formally, these are measurable functions on \(\Omega _{\mathcal{T}}\), defined by \(X_t(\omega ) = \omega _t\) and \(H_t(\omega ) = (\omega _1, \ldots , \omega _t)\).
The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:
For \(a \in \mathcal{A}\) and \(n \in \mathbb {N}\), let \(Z_{n,a} \sim \nu (a)\), independent of the bandit interaction and other \(Z_{m,b}\). In our probability space \(\Omega \), we can take for \(Z_{n,a}\) the function \(\omega \mapsto \omega _{2,n,a}\). We define \(Y_{n, a} = X_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when pulling arm \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise.
For an arm \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a}\) the time at which arm \(a\) was pulled for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the arm is not pulled \(n\) times.
Seen as processes defined on \(\Omega _B\), the processes \((H_t)_{t \in \mathbb {N}}\), \((A_t)_{t \in \mathbb {N}}\), and \((X_t)_{t \in \mathbb {N}}\) are adapted to the filtration \(\mathcal{F}_B\).
For \(a \in \mathcal{A}\), let \(Y^{(a)} = (Y_{n,a})_{n \in \mathbb {N}} \in \mathbb {R}^{\mathbb {N}}\) be the sequence of rewards obtained from pulling arm \(a\). Then the sequences \((Y^{(a)})_{a \in \mathcal{A}}\) are independent.
\(T_{n,a}\) is a stopping time for the filtration generated by the history of pulls and rewards.
Let \(a \in \mathcal{A}\). Seen as a process defined on \(\Omega _B\), \((N_{t,a})_{t \in \mathbb {N}}\) is predictable with respect to the filtration \(\mathcal{F}_B\) (that is, \((N_{t,a})_{t \in \mathbb {N}}\) is adapted to \((\mathcal{F}_{B, t-1})_{t \in \mathbb {N}}\)).
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) with \(\Delta _a {\gt} 0\), we have \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\).
Let \(X_1, \ldots , X_n\) be independent random variables such that \(X_i\) is \(\sigma _i^2\)-sub-Gaussian for \(i \in [n]\). Then for any \(t \ge 0\),
Let \((\Omega _t)_{t \in \mathbb {N}}\) be a family of measurable spaces. Let \((\kappa _t)_{t \in \mathbb {N}}\) be a family of Markov kernels such that for any \(t\), \(\kappa _t\) is a kernel from \(\prod _{i=0}^t \Omega _{i}\) to \(\Omega _{t+1}\). Then there exists a unique Markov kernel \(\xi : \Omega _0 \rightsquigarrow \prod _{i = 1}^{\infty } \Omega _{i}\) such that for any \(n \ge 1\), \(\pi _{[1,n]*} \xi = \kappa _0 \otimes \ldots \otimes \kappa _{n-1}\). Here \(\pi _{[1,n]} : \prod _{i=1}^{\infty } \Omega _i \to \prod _{i=1}^n \Omega _i\) is the projection on the first \(n\) coordinates.
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by