LeanBandits

5 Bandit algorithms

5.1 Explore-Then-Commit

Note: times start at 0 to be consistent with Lean.

Note: we will describe the algorithm by writing \(A_t = ...\), but our formal bandit model needs a policy \(\pi _t\) that gives the distribution of the arm to pull. What me mean is that \(\pi _t\) is a Dirac distribution at that arm.

Definition 5.1 Explore-Then-Commit algorithm
#

The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:

  1. for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),

  2. compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),

  3. for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).

Lemma 5.2

For the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) and any time \(t \ge Km\), we have

\begin{align*} N_{t,a} & = m + (t - Km) \mathbb {I}\{ \hat{A}_m^* = a\} \: . \end{align*}
Proof

If \(\hat{A}_m^* = a\), then we have \(S_{Km, a^*} \le S_{Km, a}\).

Proof
Lemma 5.4

Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) with \(\Delta _a {\gt} 0\), we have \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\).

Proof

By Lemma 5.3,

\begin{align*} \mathbb {P}(\hat{A}_m^* = a) & \le \mathbb {P}(S_{Km, a} \ge S_{Km, a^*}) \: . \end{align*}

By Lemma 4.9, and then the concentration inequality of Lemma 4.10 we have

\begin{align*} P_{\mathcal{A}}\left(S_{Km, a^*} \le S_{Km, a}\right) & \le (\otimes _a \nu (a))^{\otimes \mathbb {N}} \left( \sum _{s=0}^{m-1} \omega _{s, a^*} \le \sum _{s=0}^{m-1} \omega _{s, a} \right) \\ & \le \exp \left( -m \frac{\Delta _a^2}{4} \right) \: . \end{align*}
Theorem 5.5

Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by

\begin{align*} \mathbb {E}[R_T] & \le m \sum _{a=1}^K \Delta _a + (T - Km) \sum _{a=1}^K \Delta _a \exp \left(- \frac{m \Delta _a^2}{4}\right) \: . \end{align*}
Proof

By Lemma 3.39, we have \(\mathbb {E}[R_T] = \sum _{a=1}^K \mathbb {E}\left[N_{T,a}\right] \Delta _a\) . It thus suffices to bound \(\mathbb {E}[N_{T,a}]\) for each arm \(a\) with \(\Delta _a {\gt} 0\). It suffices to prove that

\begin{align*} \mathbb {E}[N_{T,a}] & \le m + (T - Km) \exp \left(- \frac{m \Delta _a^2}{4}\right) \: . \end{align*}

By Lemma 5.2,

\begin{align*} N_{T,a} & = m + (T - Km) \mathbb {I}\{ \hat{A}_m^* = a\} \: . \end{align*}

It thus suffices to prove the inequality \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\) for \(\Delta _a {\gt} 0\). This is done in Lemma 5.4.

5.2 UCB

Definition 5.6 UCB algorithm

The UCB algorithm with parameter \(c \in \mathbb {R}_+\) is defined as follows:

  1. for \(t {\lt} K\), \(A_t = t \mod K\) (pull each arm once),

  2. for \(t \ge K\), \(A_t = \arg \max _{a \in [K]} \left( \hat{\mu }_{t,a} + \sqrt{\frac{c \log (t + 1)}{N_{t,a}}} \right)\), where \(\hat{\mu }_{t,a} = \frac{1}{N_{t,a}} \sum _{s=0}^{t-1} \mathbb {I}(A_s = a) X_s\) is the empirical mean of the rewards for arm \(a\).

Note: the argmax in the second step is chosen in a measurable way.

Lemma 5.7
#

For the UCB algorithm, for all time \(t \ge K\) and arm \(a \in [K]\), we have

\begin{align*} \hat{\mu }_{t,a} + \sqrt{\frac{c \log (t + 1)}{N_{t,a}}} & \le \hat{\mu }_{t,A_t} + \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}} \: . \end{align*}
Proof

By definition of the algorithm.

Suppose that we have the 3 following conditions:

  1. \(\mu ^* \le \hat{\mu }_{t, a^*} + \sqrt{\frac{c \log (t + 1)}{N_{t,a^*}}}\),

  2. \(\hat{\mu }_{t,A_t} - \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}} \le \mu _{A_t}\),

  3. \(\hat{\mu }_{t, a^*} + \sqrt{\frac{c \log (t + 1)}{N_{t,a^*}}} \le \hat{\mu }_{t,A_t} + \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}}\).

Then if \(N_{t,A_t} {\gt} 0\) we have

\begin{align*} \Delta _{A_t} & \le 2 \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}} \: . \end{align*}

And in turn, if \(\Delta _{A_t} {\gt} 0\) we get

\begin{align*} N_{t,A_t} & \le \frac{4 c \log (t + 1)}{\Delta _{A_t}^2} \: . \end{align*}

Note that the third condition is always satisfied for UCB by Lemma 5.7, but this lemma, as stated, is independent of the UCB algorithm.

Proof

Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Let \(c \ge 0\) be a real number. Then for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have

\begin{align*} P\left(0 {\lt} N_{n, a} \ \wedge \ \hat{\mu }_{n, a} + \sqrt{\frac{c \log (n + 1)}{N_{n,a}}} \le \mu _a\right) & \le \frac{1}{(n + 1)^{c / 2 - 1}} \: . \end{align*}

And also,

\begin{align*} P\left(0 {\lt} N_{n, a} \ \wedge \ \hat{\mu }_{n, a} - \sqrt{\frac{c \log (n + 1)}{N_{n,a}}} \ge \mu _a\right) & \le \frac{1}{(n + 1)^{c / 2 - 1}} \: . \end{align*}
Proof

For \(C\) a natural number, for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have

\begin{align*} N_{n,a} & \le C + 1 \\ & \quad + \sum _{s=1}^{n-1} \mathbb {I}\{ A_s = a \ \wedge \ C {\lt} N_{s,a} \ \wedge \mu ^* \le \hat{\mu }_{s, a^*} + \sqrt{\frac{c \log (s + 1)}{N_{s,a^*}}} \ \wedge \hat{\mu }_{s, A_s} - \sqrt{\frac{c \log (s + 1)}{N_{s,A_s}}} \le \mu _{A_s}\} \\ & \quad + \sum _{s=1}^{n-1} \mathbb {I}\{ 0 {\lt} N_{s, a^*} \ \wedge \ \hat{\mu }_{s, a^*} + \sqrt{\frac{c \log (s + 1)}{N_{s,a^*}}} {\lt} \mu ^*\} \\ & \quad + \sum _{s=1}^{n-1} \mathbb {I}\{ 0 {\lt} N_{s, a} \ \wedge \ \mu _a {\lt} \hat{\mu }_{s, a} - \sqrt{\frac{c \log (s + 1)}{N_{s,a}}}\} \end{align*}
Proof

For the UCB algorithm with parameter \(c \ge 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, the first sum in Lemma 5.10 is equal to zero for positive \(C\) such that \(C \ge \frac{4 c \log (n + 1)}{\Delta _a^2}\).

Proof

Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c {\gt} 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, we have

\begin{align*} \mathbb {E}[N_{n,a}] & \le \frac{4 c \log (n + 1)}{\Delta _a^2} + 2 + 2 \sum _{s=0}^{n-1} \frac{1}{(s + 1)^{c / 2 - 1}} \: . \end{align*}
Proof
Lemma 5.13

Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c {\gt} 0\), for any time \(n \in \mathbb {N}\), we have

\begin{align*} R_n & \le \sum _{a : \Delta _a {\gt} 0} \left(\frac{4 c \log (n + 1)}{\Delta _a} + 2 \Delta _a\left(1 + \sum _{s=0}^{n-1} \frac{1}{(s + 1)^{c / 2 - 1}}\right)\right) \: . \end{align*}
Proof

TODO: for \(c {\gt} 4\), the sum converges to a constant, so we get a logarithmic regret bound.