5 Bandit algorithms
5.1 Round-Robin
This is not an interesting bandit algorithm per se, but it is used as a subroutine in other algorithms and can be a simple baseline. This algorithm simply cycles through the arms in order.
The Round-Robin algorithm is defined as follows: at time \(t \in \mathbb {N}\), \(A_t = t \mod K\).
For the Round-Robin algorithm, for any arm \(a \in [K]\), at time \(Km\) we have
TODO: regret.
5.2 Explore-Then-Commit
Note: times start at 0 to be consistent with Lean.
Note: we will describe the algorithm by writing \(A_t = ...\), but our formal bandit model needs a policy \(\pi _t\) that gives the distribution of the arm to pull. What me mean is that \(\pi _t\) is a Dirac distribution at that arm.
The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:
for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),
compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),
for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).
An algorithm-environment sequence for the Explore-Then-Commit algorithm with parameter \(m\) is an algorithm-environment sequence for the Round-Robin algorithm until time \(Km - 1\). That is, ETC plays the same as Round-Robin until time \(Km - 1\).
For the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) and any time \(t \ge Km\), we have
If \(\hat{A}_m^* = a\), then we have \(S_{Km, a^*} \le S_{Km, a}\).
Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) with \(\Delta _a {\gt} 0\), we have \(P(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4 \sigma ^2}\right)\).
By Lemma 5.6,
By Lemma 4.9, and then the concentration inequality of Lemma 4.10 we have
Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by
By Lemma 3.38, we have \(P[R_T] = \sum _{a=1}^K P\left[N_{T,a}\right] \Delta _a\) . It thus suffices to bound \(P[N_{T,a}]\) for each arm \(a\) with \(\Delta _a {\gt} 0\). It suffices to prove that
By Lemma 5.5,
It thus suffices to prove the inequality \(P(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4 \sigma ^2}\right)\) for \(\Delta _a {\gt} 0\). This is done in Lemma 5.7.
5.3 UCB
The UCB algorithm with parameter \(c \in \mathbb {R}_+\) is defined as follows:
for \(t {\lt} K\), \(A_t = t \mod K\) (pull each arm once),
for \(t \ge K\), \(A_t = \arg \max _{a \in [K]} \left( \hat{\mu }_{t,a} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,a}}} \right)\), where \(\hat{\mu }_{t,a} = \frac{1}{N_{t,a}} \sum _{s=0}^{t-1} \mathbb {I}(A_s = a) X_s\) is the empirical mean of the rewards for arm \(a\).
Note: the argmax in the second step is chosen in a measurable way.
An algorithm-environment sequence for the UCB algorithm is an algorithm-environment sequence for the Round-Robin algorithm until time \(K - 1\). That is, UCB plays the same as Round-Robin until time \(K - 1\).
For the UCB algorithm, for all time \(t \ge K\) and arm \(a \in [K]\), we have
By definition of the algorithm.
Suppose that we have the 3 following conditions:
\(\mu ^* \le \hat{\mu }_{t, a^*} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,a^*}}}\),
\(\hat{\mu }_{t,A_t} - \sqrt{\frac{2 c \log (t + 1)}{N_{t,A_t}}} \le \mu _{A_t}\),
\(\hat{\mu }_{t, a^*} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,a^*}}} \le \hat{\mu }_{t,A_t} + \sqrt{\frac{2 c \log (t + 1)}{N_{t,A_t}}}\).
Then if \(N_{t,A_t} {\gt} 0\) we have
And in turn, if \(\Delta _{A_t} {\gt} 0\) we get
Note that the third condition is always satisfied for UCB by Lemma 5.11, but this lemma, as stated, is independent of the UCB algorithm.
Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). Let \(c \ge 0\) be a real number. Then for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have
And also,
For \(C\) a natural number, for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have
For the UCB algorithm with parameter \(c \sigma ^2 \ge 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, the first sum in Lemma 5.14 is equal to zero for positive \(C\) such that \(C \ge \frac{8 c \sigma ^2 \log (n + 1)}{\Delta _a^2}\).
Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c \sigma ^2 {\gt} 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, we have
Suppose that \(\nu (a)\) is \(\sigma ^2\)-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c \sigma ^2 {\gt} 0\), for any time \(n \in \mathbb {N}\), we have
TODO: for \(c {\gt} 2\), the sum converges to a constant, so we get a logarithmic regret bound.