LeanBandits

5 Bandit algorithms

5.1 Round-Robin

This is not an interesting bandit algorithm per se, but it is used as a subroutine in other algorithms and can be a simple baseline. This algorithm simply cycles through the arms in order.

Definition 5.1

The Round-Robin algorithm is defined as follows: at time \(t \in \mathbb {N}\), \(A_t = t \mod K\).

For the Round-Robin algorithm, for any arm \(a \in [K]\), at time \(Km\) we have

\begin{align*} N_{Km,a} & = m \: . \end{align*}
Proof

TODO: regret.

5.2 Explore-Then-Commit

Note: times start at 0 to be consistent with Lean.

Note: we will describe the algorithm by writing \(A_t = ...\), but our formal bandit model needs a policy \(\pi _t\) that gives the distribution of the arm to pull. What me mean is that \(\pi _t\) is a Dirac distribution at that arm.

Definition 5.3 Explore-Then-Commit algorithm

The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:

  1. for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),

  2. compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),

  3. for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).

An algorithm-environment sequence for the Explore-Then-Commit algorithm with parameter \(m\) is an algorithm-environment sequence for the Round-Robin algorithm until time \(Km - 1\). That is, ETC plays the same as Round-Robin until time \(Km - 1\).

Proof

For the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) and any time \(t \ge Km\), we have

\begin{align*} N_{t,a} & = m + (t - Km) \mathbb {I}\{ \hat{A}_m^* = a\} \: . \end{align*}
Proof
Proof

Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) with \(\Delta _a {\gt} 0\), we have \(P(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4 \sigma ^2}\right)\).

Proof

By Lemma 5.6,

\begin{align*} P(\hat{A}_m^* = a) & \le P(S_{Km, a} \ge S_{Km, a^*}) \: . \end{align*}

By Lemma 4.9, and then the concentration inequality of Lemma 4.10 we have

\begin{align*} P\left(S_{Km, a^*} \le S_{Km, a}\right) & \le (\otimes _a \nu (a))^{\otimes \mathbb {N}} \left( \sum _{s=0}^{m-1} \omega _{s, a^*} \le \sum _{s=0}^{m-1} \omega _{s, a} \right) \\ & \le \exp \left( -m \frac{\Delta _a^2}{4 \sigma ^2} \right) \: . \end{align*}

Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by

\begin{align*} P[R_T] & \le m \sum _{a=1}^K \Delta _a + (T - Km) \sum _{a=1}^K \Delta _a \exp \left(- \frac{m \Delta _a^2}{4 \sigma ^2}\right) \: . \end{align*}
Proof

By Lemma 3.38, we have \(P[R_T] = \sum _{a=1}^K P\left[N_{T,a}\right] \Delta _a\) . It thus suffices to bound \(P[N_{T,a}]\) for each arm \(a\) with \(\Delta _a {\gt} 0\). It suffices to prove that

\begin{align*} P[N_{T,a}] & \le m + (T - Km) \exp \left(- \frac{m \Delta _a^2}{4 \sigma ^2}\right) \: . \end{align*}

By Lemma 5.5,

\begin{align*} N_{T,a} & = m + (T - Km) \mathbb {I}\{ \hat{A}_m^* = a\} \: . \end{align*}

It thus suffices to prove the inequality \(P(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4 \sigma ^2}\right)\) for \(\Delta _a {\gt} 0\). This is done in Lemma 5.7.

5.3 UCB

Definition 5.9 UCB algorithm

The UCB algorithm with parameter \(c \in \mathbb {R}_+\) is defined as follows:

  1. for \(t {\lt} K\), \(A_t = t \mod K\) (pull each arm once),

  2. for \(t \ge K\), \(A_t = \arg \max _{a \in [K]} \left( \hat{\mu }_{t,a} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,a}}} \right)\), where \(\hat{\mu }_{t,a} = \frac{1}{N_{t,a}} \sum _{s=0}^{t-1} \mathbb {I}(A_s = a) X_s\) is the empirical mean of the rewards for arm \(a\).

Note: the argmax in the second step is chosen in a measurable way.

An algorithm-environment sequence for the UCB algorithm is an algorithm-environment sequence for the Round-Robin algorithm until time \(K - 1\). That is, UCB plays the same as Round-Robin until time \(K - 1\).

Proof

For the UCB algorithm, for all time \(t \ge K\) and arm \(a \in [K]\), we have

\begin{align*} \hat{\mu }_{t,a} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,a}}} & \le \hat{\mu }_{t,A_t} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,A_t}}} \: . \end{align*}
Proof

By definition of the algorithm.

Suppose that we have the 3 following conditions:

  1. \(\mu ^* \le \hat{\mu }_{t, a^*} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,a^*}}}\),

  2. \(\hat{\mu }_{t,A_t} - \sqrt{\frac{2 c \log (t + 1)}{N_{t,A_t}}} \le \mu _{A_t}\),

  3. \(\hat{\mu }_{t, a^*} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,a^*}}} \le \hat{\mu }_{t,A_t} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,A_t}}}\).

Then if \(N_{t,A_t} {\gt} 0\) we have

\begin{align*} \Delta _{A_t} & \le 2 \sqrt{\frac{2 c \log (t + 1)}{N_{t,A_t}}} \: . \end{align*}

And in turn, if \(\Delta _{A_t} {\gt} 0\) we get

\begin{align*} N_{t,A_t} & \le \frac{8 c \log (t + 1)}{\Delta _{A_t}^2} \: . \end{align*}

Note that the third condition is always satisfied for UCB by Lemma 5.11, but this lemma, as stated, is independent of the UCB algorithm.

Proof

Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). Let \(c \ge 0\) be a real number. Then for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have

\begin{align*} P\left(0 {\lt} N_{n, a} \ \wedge \ \hat{\mu }_{n, a} + \sqrt{\frac{2 c \sigma ^2 \log (n + 1)}{N_{n,a}}} \le \mu _a\right) & \le \frac{1}{(n + 1)^{c - 1}} \: . \end{align*}

And also,

\begin{align*} P\left(0 {\lt} N_{n, a} \ \wedge \ \hat{\mu }_{n, a} - \sqrt{\frac{2 c \sigma ^2 \log (n + 1)}{N_{n,a}}} \ge \mu _a\right) & \le \frac{1}{(n + 1)^{c - 1}} \: . \end{align*}
Proof

For \(C\) a natural number, for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have

\begin{align*} N_{n,a} & \le C + 1 \\ & \quad + \sum _{s=1}^{n-1} \mathbb {I}\{ A_s = a \ \wedge \ C {\lt} N_{s,a} \ \wedge \mu ^* \le \hat{\mu }_{s, a^*} + \sqrt{\frac{2 c \sigma ^2 \log (s + 1)}{N_{s,a^*}}} \ \wedge \hat{\mu }_{s, A_s} - \sqrt{\frac{2 c \sigma ^2 \log (s + 1)}{N_{s,A_s}}} \le \mu _{A_s}\} \\ & \quad + \sum _{s=1}^{n-1} \mathbb {I}\{ 0 {\lt} N_{s, a^*} \ \wedge \ \hat{\mu }_{s, a^*} + \sqrt{\frac{2 c \sigma ^2 \log (s + 1)}{N_{s,a^*}}} {\lt} \mu ^*\} \\ & \quad + \sum _{s=1}^{n-1} \mathbb {I}\{ 0 {\lt} N_{s, a} \ \wedge \ \mu _a {\lt} \hat{\mu }_{s, a} - \sqrt{\frac{2 c \sigma ^2 \log (s + 1)}{N_{s,a}}}\} \end{align*}
Proof

For the UCB algorithm with parameter \(c \sigma ^2 \ge 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, the first sum in Lemma 5.14 is equal to zero for positive \(C\) such that \(C \ge \frac{8 c \sigma ^2 \log (n + 1)}{\Delta _a^2}\).

Proof

Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c \sigma ^2 {\gt} 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, we have

\begin{align*} P[N_{n,a}] & \le \frac{8 c \sigma ^2 \log (n + 1)}{\Delta _a^2} + 2 + 2 \sum _{s=0}^{n-1} \frac{1}{(s + 1)^{c - 1}} \: . \end{align*}
Proof

Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c \sigma ^2 {\gt} 0\), for any time \(n \in \mathbb {N}\), we have

\begin{align*} P[R_n] & \le \sum _{a : \Delta _a {\gt} 0} \left(\frac{8 c \sigma ^2 \log (n + 1)}{\Delta _a} + 2 \Delta _a\left(1 + \sum _{s=0}^{n-1} \frac{1}{(s + 1)^{c - 1}}\right)\right) \: . \end{align*}
Proof

TODO: for \(c {\gt} 2\), the sum converges to a constant, so we get a logarithmic regret bound.