4 Bandit algorithms
4.1 Explore-Then-Commit
Note: times start at 0 to be consistent with Lean.
Note: we will describe the algorithm by writing \(A_t = ...\), but our formal bandit model needs a policy \(\pi _t\) that gives the distribution of the arm to pull. What me mean is that \(\pi _t\) is a Dirac distribution at that arm.
The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:
for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),
compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),
for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).
For the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) and any time \(t \ge Km\), we have
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by
By Lemma 11, we have \(\mathbb {E}[R_T] = \sum _{a=1}^K \mathbb {E}\left[N_{T,a}\right] \Delta _a\) . It thus suffices to bound \(\mathbb {E}[N_{T,a}]\) for each arm \(a\) with \(\Delta _a {\gt} 0\). It suffices to prove that
By definition of the Explore-Then-Commit algorithm (or by Lemma 19),
It thus suffices to prove the inequality \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\) for \(\Delta _a {\gt} 0\).
TODO: here we need to state in a rigorous way that the empirical means are means of \(m\) i.i.d. samples \(Y_{a,i}\) and \(Y_{a^*,i}\) from the distributions \(\nu (a)\) and \(\nu (a^*)\).
\(Y_{a,i} - Y_{a^*,i} + \Delta _a = (Y_{a,i} - \mu _a) - (Y_{a^*,i} - \mu _{a^*})\) has mean 0 and is 2-sub-Gaussian, since \(Y_{a,i}-\mu _a\) and \(Y_{a^*,i} - \mu _{a^*}\) are 1-sub-Gaussian and independent. By Hoeffding’s inequality, we have
This concludes the proof.