Dependency graph

Legend

Boxes: definitions
Ellipses: theorems and lemmas
Blue border: the statement of this result is ready to be formalized; all prerequisites are done
Orange border: the statement of this result is not ready to be formalized; the blueprint needs more work
Blue background: the proof of this result is ready to be formalized; all prerequisites are done
Green border: the statement of this result is formalized
Green background: the proof of this result is formalized
Dark green background: the proof of this result and all its ancestors are formalized
The statement of this result is in Mathlib border: this is in Mathlib

Definition 2.1Algorithm

A sequential, stochastic algorithm with actions in a measurable space \(\mathcal{A}\) and observations in a measurable space \(\mathcal{R}\) is described by the following data:

for all \(t \in \mathbb {N}\), a policy \(\pi _t : (\mathcal{A} \times \mathcal{R})^{t+1} \rightsquigarrow \mathcal{A}\), a Markov kernel which gives the distribution of the action of the algorithm at time \(t+1\) given the history of previous pulls and observations,
\(P_0 \in \mathcal{P}(\mathcal{A})\), a probability measure that gives the distribution of the first action.

LaTeX Lean

Definition 3.3Arms, rewards and history

For \(t \in \mathbb {N}\), we denote by \(A_t\) the arm pulled at time \(t\) and by \(X_t\) the reward received at time \(t\). Formally, these are measurable functions on \(\Omega = (\mathcal{A} \times \mathbb {R})^{\mathbb {N}} \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\), defined by \(A_t(\omega ) = \omega _{1,t,1}\) and \(X_t(\omega ) = \omega _{1,t,2}\). We denote by \(H_t \in (\mathcal{A} \times \mathbb {R})^{t+1}\) the history of pulls and rewards up to and including time \(t\), that is \(H_t = ((A_0, X_0), \ldots , (A_t, X_t))\). Formally, \(H_t(\omega ) = (\omega _{1,0}, \ldots , \omega _{1,t})\).

LaTeX

Lean

Bandits.arm
Bandits.reward
Bandits.hist

Definition 3.22

For an arm \(a \in \mathcal{A}\), we denote by \(\mu _a\) the mean of the rewards for that arm, that is \(\mu _a = \nu (a)[X]\). We denote by \(\mu ^*\) the mean of the best arm, that is \(\mu ^* = \max _{a \in \mathcal{A}} \mu _a\).

LaTeX

Definition 3.1Bandit

A stochastic bandit is simply a reward distribution for each arm: a Markov kernel \(\nu : \mathcal{A} \rightsquigarrow \mathbb {R}\), conditional distribution of the rewards given the arm pulled.

LaTeX

Definition 3.2Bandit probability space

By an application of the Ionescu-Tulcea theorem, an algorithm \((\pi , P_0)\) and bandit \(\nu \) together defines a probability distribution \(\mathbb {P}_B\) on the space \(\Omega _B := (\mathcal{A} \times \mathbb {R})^{\mathbb {N}}\), the space of infinite sequences of arms and rewards. We augment that probability space with a stream of rewards from each arm, independent of the bandit interaction, to get the probability space \((\Omega , \mathbb {P})\), where \(\Omega = \Omega _B \times \mathbb {R}^{\mathbb {N} \times \mathcal{A}}\) and \(\mathbb {P} = \mathbb {P}_B \otimes (\bigotimes _{n \in \mathbb {N}, a \in \mathcal{A}} \nu (a))\).

TODO: explain how the probability distribution \(\mathbb {P}_B\) is constructed.

LaTeX Lean

Definition 3.6Filtration

The filtration \(\mathcal{F}_B\) on \(\Omega _B\) generated by the history of pulls and rewards is the increasing family of \(\sigma \)-algebras \(\mathcal{F}_{B,t} = \sigma (H_0, \ldots , H_t)\), for \(t \in \mathbb {N}\).

LaTeX Lean

Definition 2.2Environment

An environment with which an algorithm interacts is described by the following data:

for all \(t \in \mathbb {N}\), a feedback \(\nu _t : (\mathcal{A} \times \mathcal{R})^{t+1} \times \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel which gives the distribution of the observation at time \(t+1\) given the history of previous pulls and observations, and the action of the algorithm at time \(t+1\),
\(\nu '_0 \in \mathcal{A} \rightsquigarrow \mathcal{R}\), a Markov kernel that gives the distribution of the first observation given the first action.

LaTeX Lean

Definition 5.1Explore-Then-Commit algorithm

The Explore-Then-Commit (ETC) algorithm with parameter \(m \in \mathbb {N}\) is defined as follows:

for \(t {\lt} Km\), \(A_t = t \mod K\) (pull each arm \(m\) times),
compute \(\hat{A}_m^* = \arg \max _{a \in [K]} \hat{\mu }_a\), where \(\hat{\mu }_a = \frac{1}{m} \sum _{t=0}^{Km-1} \mathbb {I}(A_t = a) X_t\) is the empirical mean of the rewards for arm \(a\),
for \(t \ge Km\), \(A_t = \hat{A}_m^*\) (pull the empirical best arm).

LaTeX

Lean

Bandits.ETC.nextArm
Bandits.etcAlgorithm

Definition 3.24

For an arm \(a \in \mathcal{A}\), its gap is defined as the difference between the mean of the best arm and the mean of that arm: \(\Delta _a = \mu ^* - \mu _a\).

LaTeX Lean

Definition 2.5

For \(t \in \mathbb {N}\), we denote by \(X_t \in \Omega _t\) the random variable describing the time step \(t\), and by \(H_t \in \prod _{s=0}^t \Omega _s\) the history up to time \(t\). Formally, these are measurable functions on \(\Omega _{\mathcal{T}}\), defined by \(X_t(\omega ) = \omega _t\) and \(H_t(\omega ) = (\omega _1, \ldots , \omega _t)\).

LaTeX

Definition 3.4Pull counts

For an arm \(a \in \mathcal{A}\) and a time \(t \in \mathbb {N}\), we denote by \(N_{t,a}\) the number of times that arm \(a\) has been pulled before time \(t\), that is \(N_{t,a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} \).

LaTeX Lean

Definition 3.23Regret

The regret \(R_T\) of a sequence of arms \(A_0, \ldots , A_{T-1}\) after \(T\) pulls is the difference between the cumulative reward of always playing the best arm and the cumulative reward of the sequence:

\begin{align*} R_T = T \mu ^* - \sum _{t=0}^{T-1} \mu _{A_t} \: . \end{align*}

LaTeX Lean

Definition 3.13

For \(a \in \mathcal{A}\) and \(n \in \mathbb {N}\), let \(Z_{n,a} \sim \nu (a)\), independent of the bandit interaction and other \(Z_{m,b}\). In our probability space \(\Omega \), we can take for \(Z_{n,a}\) the function \(\omega \mapsto \omega _{2,n,a}\). We define \(Y_{n, a} = X_{T_{n,a}} \mathbb {I}\{ T_{n, a} {\lt} \infty \} + Z_{n,a} \mathbb {I}\{ T_{n, a} = \infty \} \), the reward received when pulling arm \(a\) for the \(n\)-th time if that time is finite, and equal to \(Z_{n,a}\) otherwise.

LaTeX Lean

Definition 3.12

For an arm \(a \in \mathcal{A}\) and a time \(n \in \mathbb {N}\), we denote by \(T_{n,a}\) the time at which arm \(a\) was pulled for the \(n\)-th time, that is \(T_{n,a} = \min \{ s \in \mathbb {N} \mid N_{s+1,a} = n\} \). Note that \(T_{n, a}\) can be infinite if the arm is not pulled \(n\) times.

LaTeX Lean

Definition 4.1Sub-Gaussian

A real valued random variable \(X\) is \(\sigma ^2\)-sub-Gaussian if for any \(\lambda \in \mathbb {R}\),

\begin{align*} \mathbb {E}\left[e^{\lambda X}\right] & \le e^{\frac{\lambda ^2 \sigma ^2}{2}} \: . \end{align*}

LaTeX Lean

Definition 2.4

For \(\mu \in \mathcal{P}(\Omega _0)\), we call trajectory measure the probability measure \(\xi _0 \circ \mu \) on \(\Omega _{\mathcal{T}} := \prod _{i=0}^{\infty } \Omega _i\). We denote it by \(P_{\mathcal{T}}\). The \(\mathcal{T}\) subscript stands for “trajectory”.

LaTeX Lean

Lemma 3.7

Seen as processes defined on \(\Omega _B\), the processes \((H_t)_{t \in \mathbb {N}}\), \((A_t)_{t \in \mathbb {N}}\), and \((X_t)_{t \in \mathbb {N}}\) are adapted to the filtration \(\mathcal{F}_B\).

LaTeX

Lemma 2.8

For any \(t \in \mathbb {N}\), the conditional distribution \(P_{\mathcal{T}}\left[A_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to \(\pi _t\).

LaTeX Lean

Lemma 3.11

The conditional distribution of the arm \(A_{t+1}\) given the history \(H_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\pi _t(H_t)\).

LaTeX Lean

Lemma 2.9

For any \(t \in \mathbb {N}\), the conditional distribution \(P_{\mathcal{T}}\left[R_{t+1} \mid H_t, A_{t+1}\right]\) is \(((H_t, A_{t+1})_* P_{\mathcal{T}})\)-almost surely equal to \(\nu _t\).

LaTeX Lean

Lemma 2.11

The conditional distribution \(P_{\mathcal{T}}\left[R_0 \mid A_0\right]\) is \((A_{0*} P_{\mathcal{T}})\)-almost surely equal to \(\nu '_0\).

LaTeX Lean

Lemma 3.9

The conditional distribution of the reward \(X_t\) given the arm \(A_t\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(\nu (A_t)\).

LaTeX Lean

Lemma 2.6

For any \(t \in \mathbb {N}\), the conditional distribution \(P_{\mathcal{T}}\left[X_{t+1} \mid H_t\right]\) is \(((H_t)_* P_{\mathcal{T}})\)-almost surely equal to \(\kappa _t\).

LaTeX Lean

Lemma 3.15

For \(n {\gt} 0\) and \(a \in \mathcal{A}\), the law of \(Y_{n,a}\) is \(\nu (a)\).

LaTeX Lean

Lemma 4.3

For \(X\) a \(\sigma ^2\)-sub-Gaussian random variable, for any \(t \ge 0\),

\begin{align*} \mathbb {P}(X \ge t) & \le \exp \left(- \frac{t^2}{2 \sigma ^2}\right) \: . \end{align*}

LaTeX Lean

Lemma 3.16

The rewards \((Y_{n,a})_{n \in \mathbb {N}}\) are independent.

LaTeX Lean

Lemma 3.17

For \(a \in \mathcal{A}\), let \(Y^{(a)} = (Y_{n,a})_{n \in \mathbb {N}} \in \mathbb {R}^{\mathbb {N}}\) be the sequence of rewards obtained from pulling arm \(a\). Then the sequences \((Y^{(a)})_{a \in \mathcal{A}}\) are independent.

LaTeX

Lemma 3.14

\(T_{n,a}\) is a stopping time for the filtration generated by the history of pulls and rewards.

LaTeX

Lemma 2.10

The law of \(A_0\) under \(P_{\mathcal{T}}\) is \(\alpha _0\).

LaTeX Lean

Lemma 3.10

The law of the arm \(A_0\) in the bandit probability space \((\Omega , \mathbb {P})\) is \(P_0\).

LaTeX Lean

Lemma 2.7

The law of \(X_0\) under \(P_{\mathcal{T}}\) is \(\mu \).

LaTeX

Lemma 3.8

Let \(a \in \mathcal{A}\). Seen as a process defined on \(\Omega _B\), \((N_{t,a})_{t \in \mathbb {N}}\) is predictable with respect to the filtration \(\mathcal{F}_B\) (that is, \((N_{t,a})_{t \in \mathbb {N}}\) is adapted to \((\mathcal{F}_{B, t-1})_{t \in \mathbb {N}}\)).

LaTeX

Lemma 5.3

Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) with \(\Delta _a {\gt} 0\), we have \(\mathbb {P}(\hat{A}_m^* = a) \le \exp \left(- \frac{m \Delta _a^2}{4}\right)\).

LaTeX Lean

Lemma 5.2

For the Explore-Then-Commit algorithm with parameter \(m\), for any arm \(a \in [K]\) and any time \(t \ge Km\), we have

\begin{align*} N_{t,a} & = m + (t - Km) \mathbb {I}\{ \hat{A}_m^* = a\} \: . \end{align*}

LaTeX Lean

Lemma 3.25

For \(\mathcal{A}\) finite, the regret \(R_T\) can be expressed as a sum over the arms and their gaps:

\begin{align*} R_T = \sum _{a \in \mathcal{A}} N_{T,a} \Delta _a \: . \end{align*}

LaTeX Lean

Lemma 3.20

\(Y_{N_{t+1, A_t}, A_t} = X_t\) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).

LaTeX Lean

Lemma 3.19

\(T_{N_{t+1, A_t}, A_t} = t\) for all \(t \in \mathbb {N}\).

LaTeX Lean

Lemma 3.18

\(T_{N_{t+1, a}, a} \le t {\lt} \infty \) for all \(t \in \mathbb {N}\) and \(a \in \mathcal{A}\).

LaTeX Lean

Lemma 4.2

If \(X\) is \(\sigma _1^2\)-sub-Gaussian and \(Y\) is \(\sigma _2^2\)-sub-Gaussian, and \(X\) and \(Y\) are independent, then \(X + Y\) is \((\sigma _1^2 + \sigma _2^2)\)-sub-Gaussian.

LaTeX Lean

Lemma 3.21

\begin{align*} \sum _{n=1}^{N_{t, a}} Y_{n, a} = \sum _{s=0}^{t-1} \mathbb {I}\{ A_s = a\} X_s \: . \end{align*}

LaTeX Lean

Theorem 4.4

Let \(X_1, \ldots , X_n\) be independent random variables such that \(X_i\) is \(\sigma _i^2\)-sub-Gaussian for \(i \in [n]\). Then for any \(t \ge 0\),

\begin{align*} \mathbb {P}\left(\sum _{i=1}^n X_i \ge t\right) & \le \exp \left(- \frac{t^2}{2 \sum _{i=1}^n \sigma _i^2}\right) \: . \end{align*}

LaTeX Lean

Theorem 2.3Ionescu-Tulcea

Let \((\Omega _t)_{t \in \mathbb {N}}\) be a family of measurable spaces. Let \((\kappa _t)_{t \in \mathbb {N}}\) be a family of Markov kernels such that for any \(t\), \(\kappa _t\) is a kernel from \(\prod _{i=0}^t \Omega _{i}\) to \(\Omega _{t+1}\). Then there exists a unique Markov kernel \(\xi : \Omega _0 \rightsquigarrow \prod _{i = 1}^{\infty } \Omega _{i}\) such that for any \(n \ge 1\), \(\pi _{[1,n]*} \xi = \kappa _0 \otimes \ldots \otimes \kappa _{n-1}\). Here \(\pi _{[1,n]} : \prod _{i=1}^{\infty } \Omega _i \to \prod _{i=1}^n \Omega _i\) is the projection on the first \(n\) coordinates.

LaTeX Lean

Theorem 5.4

Suppose that \(\nu (a)\) is 1-sub-Gaussian for all arms \(a \in [K]\). Then for the Explore-Then-Commit algorithm with parameter \(m\), the expected regret after \(T\) pulls with \(T \ge Km\) is bounded by

\begin{align*} \mathbb {E}[R_T] & \le m \sum _{a=1}^K \Delta _a + (T - Km) \sum _{a=1}^K \Delta _a \exp \left(- \frac{m \Delta _a^2}{4}\right) \: . \end{align*}

LaTeX Lean