Documentation

LeanBandits.BanditAlgorithms.ETC

The Explore-Then-Commit Algorithm #

theorem ae_eq_set_iff {α : Type u_1} { : MeasurableSpace α} {μ : MeasureTheory.Measure α} {s t : Set α} :
s =ᵐ[μ] t ∀ᵐ (a : α) μ, a s a t
noncomputable def Bandits.ETC.nextArm {K : } (hK : 0 < K) (m n : ) (h : (Finset.Iic n)Fin K × ) :
Fin K

Arm pulled by the ETC algorithm at time n + 1. For n < K * m - 1, this is arm n % K. For n = K * m - 1, this is the arm with the highest empirical mean after the exploration phase. For n ≥ K * m, this is the same arm as at time n.

Equations
Instances For
    theorem Bandits.ETC.measurable_nextArm {K : } (hK : 0 < K) (m n : ) :

    The next arm pulled by ETC is chosen in a measurable way.

    noncomputable def Bandits.etcAlgorithm {K : } (hK : 0 < K) (m : ) :

    The Explore-Then-Commit algorithm: deterministic algorithm that chooses the next arm according to ETC.nextArm.

    Equations
    Instances For
      theorem Bandits.ETC.arm_zero {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) :
      A 0 =ᵐ[P] fun (x : Ω) => 0, hK
      theorem Bandits.ETC.arm_ae_eq_etcNextArm {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (n : ) :
      A (n + 1) =ᵐ[P] fun (ω : Ω) => nextArm hK m n (Learning.IsAlgEnvSeq.hist A R n ω)
      theorem Bandits.ETC.arm_of_lt {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) {n : } (hn : n < K * m) :
      A n =ᵐ[P] fun (x : Ω) => n % K,

      For n < K * m, the arm pulled at time n is the arm n % K.

      theorem Bandits.ETC.arm_mul {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (hm : m 0) :
      A (K * m) =ᵐ[P] fun (ω : Ω) => measurableArgmax (Learning.empMean' (K * m - 1)) (Learning.IsAlgEnvSeq.hist A R (K * m - 1) ω)

      The arm pulled at time K * m is the arm with the highest empirical mean after the exploration phase.

      theorem Bandits.ETC.arm_add_one_of_ge {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) {n : } (hm : m 0) (hn : K * m n) :
      A (n + 1) =ᵐ[P] fun (ω : Ω) => A n ω

      For n ≥ K * m, the arm pulled at time n + 1 is the same as the arm pulled at time n.

      theorem Bandits.ETC.arm_of_ge {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) {n : } (hm : m 0) (hn : K * m n) :
      A n =ᵐ[P] A (K * m)

      For n ≥ K * m, the arm pulled at time n is the same as the arm pulled at time K * m.

      theorem Bandits.ETC.pullCount_mul {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) :
      Learning.pullCount A a (K * m) =ᵐ[P] fun (x : Ω) => m

      At time K * m, the number of pulls of each arm is equal to m.

      theorem Bandits.ETC.pullCount_add_one_of_ge {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) (hm : m 0) {n : } (hn : K * m n) :
      Learning.pullCount A a (n + 1) =ᵐ[P] fun (ω : Ω) => Learning.pullCount A a n ω + {ω' : Ω | A (K * m) ω' = a}.indicator (fun (x : Ω) => 1) ω
      theorem Bandits.ETC.pullCount_of_ge {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) (hm : m 0) {n : } (hn : K * m n) :
      Learning.pullCount A a n =ᵐ[P] fun (ω : Ω) => m + (n - K * m) * {ω' : Ω | A (K * m) ω' = a}.indicator (fun (x : Ω) => 1) ω

      For n ≥ K * m, the number of pulls of each arm a at time n is equal to m plus n - K * m if arm a is the best arm after the exploration phase.

      theorem Bandits.ETC.sumRewards_bestArm_le_of_arm_mul_eq {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) (a : Fin K) (hm : m 0) :
      ∀ᵐ (h : Ω) P, A (K * m) h = aLearning.sumRewards A R (bestArm ν) (K * m) h Learning.sumRewards A R a (K * m) h

      If at time K * m the algorithm chooses arm a, then the total reward obtained by pulling arm a is at least the total reward obtained by pulling the best arm.

      theorem Bandits.ETC.probReal_sumRewards_le_sumRewards_le {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) ( : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ) => x - (x : ), id x ν a) 1 (ν a)) (a : Fin K) :
      P.real {ω : Ω | Learning.sumRewards A R (bestArm ν) (K * m) ω Learning.sumRewards A R a (K * m) ω} Real.exp (-m * gap ν a ^ 2 / 4)
      theorem Bandits.ETC.prob_arm_mul_eq_le {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) ( : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ) => x - (x : ), id x ν a) 1 (ν a)) (a : Fin K) (hm : m 0) :
      P.real {ω : Ω | A (K * m) ω = a} Real.exp (-m * gap ν a ^ 2 / 4)

      The probability that at time K * m the ETC algorithm chooses arm a is at most exp(- m * Δ_a^2 / 4).

      theorem Bandits.ETC.expectation_pullCount_le {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) ( : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ) => x - (x : ), id x ν a) 1 (ν a)) (a : Fin K) (hm : m 0) {n : } (hn : K * m n) :
      (x : Ω), (fun (ω : Ω) => (Learning.pullCount A a n ω)) x P m + (n - K * m) * Real.exp (-m * gap ν a ^ 2 / 4)

      Bound on the expectation of the number of pulls of each arm by the ETC algorithm.

      theorem Bandits.ETC.regret_le {K : } {hK : 0 < K} {m : } {ν : ProbabilityTheory.Kernel (Fin K) } [ProbabilityTheory.IsMarkovKernel ν] {Ω : Type u_1} { : MeasurableSpace Ω} {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {A : ΩFin K} {R : Ω} [Nonempty (Fin K)] (h : Learning.IsAlgEnvSeq A R (etcAlgorithm hK m) (Learning.stationaryEnv ν) P) ( : ∀ (a : Fin K), ProbabilityTheory.HasSubgaussianMGF (fun (x : ) => x - (x : ), id x ν a) 1 (ν a)) (hm : m 0) (n : ) (hn : K * m n) :
      (x : Ω), regret ν A n x P a : Fin K, gap ν a * (m + (n - K * m) * Real.exp (-m * gap ν a ^ 2 / 4))

      Regret bound for the ETC algorithm.