- Boxes
- definitions
- Ellipses
- theorems and lemmas
- Blue border
- the statement of this result is ready to be formalized; all prerequisites are done
- Orange border
- the statement of this result is not ready to be formalized; the blueprint needs more work
- Blue background
- the proof of this result is ready to be formalized; all prerequisites are done
- Green border
- the statement of this result is formalized
- Green background
- the proof of this result is formalized
- Dark green background
- the proof of this result and all its ancestors are formalized
- The statement of this result is in Mathlib border
- this is in Mathlib
For \(a \in \mathcal{A}\) and \(n \in \mathbb {N}\), let \(Z_{n,a} \sim \nu (a)\), independent of everything else. We define \(Y_{n, a} = X_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when pulling arm \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise.
For \(t \in \mathbb {N}\), we denote by \(A_t\) the arm pulled at time \(t\) and by \(X_t\) the reward received at time \(t\). Formally, these are measurable functions on \(\Omega = (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), defined by \(A_t(\omega ) = \omega _{t,1}\) and \(X_t(\omega ) = \omega _{t,2}\). We denote by \(H_t \in (\mathcal{A} \times \mathbb {R})^{t+1}\) the history of pulls and rewards up to and including time \(t\), that is \(H_t = ((A_0, X_0), \ldots , (A_t, X_t))\). Formally, \(H_t(\omega ) = (\omega _0, \ldots , \omega _t)\).
For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[X]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).
The interaction of an algorithm with a stochastic bandit with arms in \(\mathcal{A}\) (a measurable space) and real rewards is described by the following data:
\(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), a Markov kernel, conditional distribution of the rewards given the arm pulled,
for all \(t \in \mathbb {N}\), a policy \(\pi _t : (\mathcal{A} \times \mathbb {R})^{t+1} \rightsquigarrow \mathcal{A}\), a Markov kernel which gives the distribution of the arm to pull at time \(t+1\) given the history of previous pulls and rewards,
\(P_0 \in \mathcal{P}(\mathcal{A})\), a probability measure that gives the distribution of the first arm to pull.
By an application of the Ionescu-Tulcea theorem, a bandit \(\mathcal{B} = (\nu , \pi , P_0)\) defines a probability distribution on the space \(\Omega := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We denote that distribution by \(\mathbb {P}\). TODO: explain how the probability distribution is constructed.
The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:
for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),
compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),
for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).
The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:
For an arm \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a}\) the time at which arm \(a\) was pulled for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the arm is not pulled \(n\) times.
\(Y_{N_{t+1,A_t}, A_t} = X_t\) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).
The law of the arm \(A_0\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(P_0\).
For the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) and any time \(t \ge Km\), we have
\(T_{N_{t, a}, a} \le t - 1 {\lt} \infty \) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by