5 Bandit algorithms
5.1 Explore-Then-Commit
Note: times start at 0 to be consistent with Lean.
Note: we will describe the algorithm by writing \(A_t = ...\), but our formal bandit model needs a policy \(\pi _t\) that gives the distribution of the arm to pull. What me mean is that \(\pi _t\) is a Dirac distribution at that arm.
The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:
for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),
compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),
for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).
For the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) and any time \(t \ge Km\), we have
If \(\hat{A}_m^* = a\), then we have \(S_{Km, a^*} \le S_{Km, a}\).
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) with \(\Delta _a {\gt} 0\), we have \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\).
By Lemma 5.3,
By Lemma 4.9, and then the concentration inequality of Lemma 4.10 we have
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by
By Lemma 3.39, we have \(\mathbb {E}[R_T] = \sum _{a=1}^K \mathbb {E}\left[N_{T,a}\right] \Delta _a\) . It thus suffices to bound \(\mathbb {E}[N_{T,a}]\) for each arm \(a\) with \(\Delta _a {\gt} 0\). It suffices to prove that
By Lemma 5.2,
It thus suffices to prove the inequality \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\) for \(\Delta _a {\gt} 0\). This is done in Lemma 5.4.
5.2 UCB
The UCB algorithm with parameter \(c \in \mathbb {R}_+\) is defined as follows:
for \(t {\lt} K\), \(A_t = t \mod K\) (pull each arm once),
for \(t \ge K\), \(A_t = \arg \max _{a \in [K]} \left( \hat{\mu }_{t,a} + \sqrt{\frac{c \log (t + 1)}{N_{t,a}}} \right)\), where \(\hat{\mu }_{t,a} = \frac{1}{N_{t,a}} \sum _{s=0}^{t-1} \mathbb {I}(A_s = a) X_s\) is the empirical mean of the rewards for arm \(a\).
Note: the argmax in the second step is chosen in a measurable way.
For the UCB algorithm, for all time \(t \ge K\) and arm \(a \in [K]\), we have
By definition of the algorithm.
Suppose that we have the 3 following conditions:
\(\mu ^* \le \hat{\mu }_{t, a^*} + \sqrt{\frac{c \log (t + 1)}{N_{t,a^*}}}\),
\(\hat{\mu }_{t,A_t} - \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}} \le \mu _{A_t}\),
\(\hat{\mu }_{t, a^*} + \sqrt{\frac{c \log (t + 1)}{N_{t,a^*}}} \le \hat{\mu }_{t,A_t} + \sqrt{\frac{c \log (t + 1)}{N_{t,A_t}}}\).
Then if \(N_{t,A_t} {\gt} 0\) we have
And in turn, if \(\Delta _{A_t} {\gt} 0\) we get
Note that the third condition is always satisfied for UCB by Lemma 5.7, but this lemma, as stated, is independent of the UCB algorithm.
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Let \(c \ge 0\) be a real number. Then for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have
And also,
For \(C\) a natural number, for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\), we have
For the UCB algorithm with parameter \(c \ge 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, the first sum in Lemma 5.10 is equal to zero for positive \(C\) such that \(C \ge \frac{4 c \log (n + 1)}{\Delta _a^2}\).
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c {\gt} 0\), for any time \(n \in \mathbb {N}\) and any arm \(a \in [K]\) with positive gap, we have
Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). For the UCB algorithm with parameter \(c {\gt} 0\), for any time \(n \in \mathbb {N}\), we have
TODO: for \(c {\gt} 4\), the sum converges to a constant, so we get a logarithmic regret bound.