The Explore-Then-Commit Algorithm #

noncomputable def Bandits.ETC.nextArm {K : ℕ} (hK : 0 < K) (m n : ℕ) (h : ↥(Finset.Iic n) → Fin K × ℝ) :

Arm pulled by the ETC algorithm at time n + 1. For n < K * m - 1, this is arm n % K. For n = K * m - 1, this is the arm with the highest empirical mean after the exploration phase. For n ≥ K * m, this is the same arm as at time n.

Equations

Bandits.ETC.nextArm hK m n h = if hn : n < K * m - 1 then ⟨(n + 1) % K, ⋯⟩ else if hn_eq : n = K * m - 1 then measurableArgmax (Learning.empMean' n) h else (h ⟨n, ⋯⟩).1

Instances For

source

theorem Bandits.ETC.measurable_nextArm {K : ℕ} (hK : 0 < K) (m n : ℕ) :

Measurable (nextArm hK m n)

The next arm pulled by ETC is chosen in a measurable way.

source

noncomputable def Bandits.etcAlgorithm {K : ℕ} (hK : 0 < K) (m : ℕ) :

Learning.Algorithm (Fin K) ℝ

The Explore-Then-Commit algorithm: deterministic algorithm that chooses the next arm according to ETC.nextArm.

Equations

Bandits.etcAlgorithm hK m = Learning.detAlgorithm (Bandits.ETC.nextArm hK m) ⋯ ⟨0, hK⟩

Instances For

source

theorem Bandits.ETC.isAlgEnvSeqUntil_roundRobinAlgorithm {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) :

Learning.IsAlgEnvSeqUntil A R (roundRobinAlgorithm hK) (Learning.stationaryEnv ν) P (K * m - 1)

Until round K * m - 1, the ETC algorithm behaves like the Round-Robin algorithm.

source

theorem Bandits.ETC.arm_zero {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) :

A 0 =ᵐ[P] fun (x : Ω) => ⟨0, hK⟩

source

theorem Bandits.ETC.arm_ae_eq_etcNextArm {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (n : ℕ) :

A (n + 1) =ᵐ[P] fun (ω : Ω) => nextArm hK m n (Learning.IsAlgEnvSeq.hist A R n ω)

source

theorem Bandits.ETC.arm_of_lt {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) {n : ℕ} (hn : n < K * m) :

A n =ᵐ[P] fun (x : Ω) => ⟨n % K, ⋯⟩

For n < K * m, the arm pulled at time n is the arm n % K.

source

theorem Bandits.ETC.arm_mul {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (hm : m ≠ 0) :

A (K * m) =ᵐ[P] fun (ω : Ω) => measurableArgmax (Learning.empMean' (K * m - 1)) (Learning.IsAlgEnvSeq.hist A R (K * m - 1) ω)

The arm pulled at time K * m is the arm with the highest empirical mean after the exploration phase.

source

theorem Bandits.ETC.arm_add_one_of_ge {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) {n : ℕ} (hm : m ≠ 0) (hn : K * m ≤ n) :

A (n + 1) =ᵐ[P] fun (ω : Ω) => A n ω

For n ≥ K * m, the arm pulled at time n + 1 is the same as the arm pulled at time n.

source

theorem Bandits.ETC.arm_of_ge {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) {n : ℕ} (hm : m ≠ 0) (hn : K * m ≤ n) :

A n =ᵐ[P] A (K * m)

For n ≥ K * m, the arm pulled at time n is the same as the arm pulled at time K * m.

source

theorem Bandits.ETC.pullCount_mul {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) :

Learning.pullCount A a (K * m) =ᵐ[P] fun (x : Ω) => m

At time K * m, the number of pulls of each arm is equal to m.

source

theorem Bandits.ETC.pullCount_add_one_of_ge {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) (hm : m ≠ 0) {n : ℕ} (hn : K * m ≤ n) :

Learning.pullCount A a (n + 1) =ᵐ[P] fun (ω : Ω) => Learning.pullCount A a n ω + {ω' : Ω | A (K * m) ω' = a}.indicator (fun (x : Ω) => 1) ω

source

theorem Bandits.ETC.pullCount_of_ge {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) (hm : m ≠ 0) {n : ℕ} (hn : K * m ≤ n) :

Learning.pullCount A a n =ᵐ[P] fun (ω : Ω) => m + (n - K * m) * {ω' : Ω | A (K * m) ω' = a}.indicator (fun (x : Ω) => 1) ω

For n ≥ K * m, the number of pulls of each arm a at time n is equal to m plus n - K * m if arm a is the best arm after the exploration phase.

source

theorem Bandits.ETC.sumRewards_bestArm_le_of_arm_mul_eq {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) (hm : m ≠ 0) :

∀ᵐ (h : Ω) ∂P, A (K * m) h = a → Learning.sumRewards A R (bestArm ν) (K * m) h ≤ Learning.sumRewards A R a (K * m) h

If at time K * m the algorithm chooses arm a, then the total reward obtained by pulling arm a is at least the total reward obtained by pulling the best arm.

source

theorem Bandits.ETC.probReal_sumRewards_le_sumRewards_le {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {σ2 : NNReal} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (hν : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ℝ) => x - ∫ (x : ℝ), id x ∂ν a) σ2 (ν a)) (a : Fin K) :

P.real {ω : Ω | Learning.sumRewards A R (bestArm ν) (K * m) ω ≤ Learning.sumRewards A R a (K * m) ω} ≤ Real.exp (-↑m * gap ν a ^ 2 / (4 * ↑σ2))

source

theorem Bandits.ETC.prob_arm_mul_eq_le {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {σ2 : NNReal} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (hν : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ℝ) => x - ∫ (x : ℝ), id x ∂ν a) σ2 (ν a)) (a : Fin K) (hm : m ≠ 0) :

P.real {ω : Ω | A (K * m) ω = a} ≤ Real.exp (-↑m * gap ν a ^ 2 / (4 * ↑σ2))

The probability that at time K * m the ETC algorithm chooses arm a is at most exp(- m * Δ_a^2 / 4).

source

theorem Bandits.ETC.expectation_pullCount_le {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {σ2 : NNReal} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (hν : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ℝ) => x - ∫ (x : ℝ), id x ∂ν a) σ2 (ν a)) (a : Fin K) (hm : m ≠ 0) {n : ℕ} (hn : K * m ≤ n) :

∫ (x : Ω), (fun (ω : Ω) => ↑(Learning.pullCount A a n ω)) x ∂P ≤ ↑m + (↑n - ↑K * ↑m) * Real.exp (-↑m * gap ν a ^ 2 / (4 * ↑σ2))

Bound on the expectation of the number of pulls of each arm by the ETC algorithm.

source

theorem Bandits.ETC.regret_le {K : ℕ} {hK : 0 < K} {m : ℕ} {ν : ProbabilityTheory.Kernel (Fin K) ℝ} [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} {mΩ : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ℕ → Ω → Fin K} {R : ℕ → Ω → ℝ} {σ2 : NNReal} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (hν : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ℝ) => x - ∫ (x : ℝ), id x ∂ν a) σ2 (ν a)) (hm : m ≠ 0) (n : ℕ) (hn : K * m ≤ n) :

∫ (x : Ω), regret ν A n x ∂P ≤ ∑ a : Fin K, gap ν a * (↑m + (↑n - ↑K * ↑m) * Real.exp (-↑m * gap ν a ^ 2 / (4 * ↑σ2)))

Regret bound for the ETC algorithm.

Documentation

LeanBandits.BanditAlgorithms.ETC

The Explore-Then-Commit Algorithm #