Regret, gap, best arm #

noncomputable def Bandits.gap {α : Type u_1} {mα : MeasurableSpace α} (ν : ProbabilityTheory.Kernel α ℝ) (a : α) :

Gap of an action a: difference between the highest mean of the actions and the mean of a.

Equations

Bandits.gap ν a = (⨆ (i : α), ∫ (x : ℝ), id x ∂ν i) - ∫ (x : ℝ), id x ∂ν a

Instances For

theorem Bandits.gap_nonneg {α : Type u_1} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {a : α} [Finite α] :

0 ≤ gap ν a

source

noncomputable def Bandits.regret {α : Type u_1} {Ω : Type u_2} {mα : MeasurableSpace α} (ν : ProbabilityTheory.Kernel α ℝ) (A : ℕ → Ω → α) (t : ℕ) (ω : Ω) :

ℝ

Regret of a sequence of pulls k : ℕ → α at time t for the reward kernel ν ; Kernel α ℝ.

Equations

Bandits.regret ν A t ω = (↑t * ⨆ (a : α), ∫ (x : ℝ), id x ∂ν a) - ∑ s ∈ Finset.range t, ∫ (x : ℝ), id x ∂ν (A s ω)

Instances For

source

theorem Bandits.regret_eq_sum_gap {α : Type u_1} {Ω : Type u_2} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {ω : Ω} {t : ℕ} :

regret ν A t ω = ∑ s ∈ Finset.range t, gap ν (A s ω)

source

theorem Bandits.regret_nonneg {α : Type u_1} {Ω : Type u_2} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {ω : Ω} {t : ℕ} [Finite α] :

0 ≤ regret ν A t ω

source

theorem Bandits.gap_eq_zero_of_regret_eq_zero {α : Type u_1} {Ω : Type u_2} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {ω : Ω} {t : ℕ} [Finite α] (hr : regret ν A t ω = 0) {s : ℕ} (hs : s < t) :

gap ν (A s ω) = 0

source

theorem Bandits.regret_eq_sum_pullCount_mul_gap {α : Type u_1} {Ω : Type u_2} [DecidableEq α] {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {ω : Ω} {t : ℕ} [Fintype α] :

regret ν A t ω = ∑ a : α, ↑(Learning.pullCount A a t ω) * gap ν a

source

theorem Bandits.integral_regret_eq_sum_gap_mul_integral_pullCount {α : Type u_1} {Ω : Type u_2} [DecidableEq α] {mα : MeasurableSpace α} {mΩ : MeasurableSpace Ω} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {n : ℕ} [StandardBorelSpace α] [Fintype α] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] (hA : ∀ (n : ℕ), Measurable (A n)) :

∫ (x : Ω), regret ν A n x ∂P = ∑ a : α, gap ν a * ∫ (x : Ω), (fun (ω : Ω) => ↑(Learning.pullCount A a n ω)) x ∂P

source

theorem Bandits.integral_regret_le_of_forall_integral_pullCount_le {α : Type u_1} {Ω : Type u_2} [DecidableEq α] {mα : MeasurableSpace α} {mΩ : MeasurableSpace Ω} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {R : ℕ → Ω → ℝ} {n : ℕ} [Nonempty α] [StandardBorelSpace α] [Fintype α] {P : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure P] {alg : Learning.Algorithm α ℝ} {env : Learning.Environment α ℝ} {B : α → ℝ} (h : Learning.IsAlgEnvSeq A R alg env P) (h_le : ∀ (a : α), gap ν a ≠ 0 → ∫ (ω : Ω), ↑(Learning.pullCount A a n ω) ∂P ≤ B a) :

∫ (x : Ω), regret ν A n x ∂P ≤ ∑ a : α, gap ν a * B a

To bound the expected regret, it suffices to bound the expected number of pulls for each action with positive gap.

source

noncomputable def Bandits.bestArm {α : Type u_1} {mα : MeasurableSpace α} [Fintype α] [Nonempty α] (ν : ProbabilityTheory.Kernel α ℝ) :

action with the highest mean.

Equations

Bandits.bestArm ν = ⋯.choose

Instances For

source

theorem Bandits.le_bestArm {α : Type u_1} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} [Fintype α] [Nonempty α] (a : α) :

∫ (x : ℝ), id x ∂ν a ≤ ∫ (x : ℝ), id x ∂ν (bestArm ν)

source

theorem Bandits.gap_eq_bestArm_sub {α : Type u_1} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {a : α} [Fintype α] [Nonempty α] :

gap ν a = ∫ (x : ℝ), id x ∂ν (bestArm ν) - ∫ (x : ℝ), id x ∂ν a

source

@[simp]

theorem Bandits.gap_bestArm {α : Type u_1} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} [Fintype α] [Nonempty α] :

gap ν (bestArm ν) = 0

source

theorem Bandits.integral_eq_of_gap_eq_zero {α : Type u_1} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {a : α} [Fintype α] [Nonempty α] (hg : gap ν a = 0) :

∫ (x : ℝ), id x ∂ν (bestArm ν) = ∫ (x : ℝ), id x ∂ν a

source

theorem Bandits.avg_mean_reward_tendsto_of_sublinear_regret {α : Type u_1} {Ω : Type u_2} {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {ω : Ω} (hr : (fun (x : ℕ) => regret ν A x ω) =o[Filter.atTop ] fun (t : ℕ) => ↑t) :

Filter.Tendsto (fun (t : ℕ) => (∑ s ∈ Finset.range t, ∫ (x : ℝ), id x ∂ν (A s ω)) / ↑t) Filter.atTop (nhds (⨆ (a : α), ∫ (x : ℝ), id x ∂ν a))

If the regret is sublinear, the average mean reward tends to the highest mean of the arms.

source

theorem Bandits.pullCount_rate_tendsto_of_sublinear_regret {α : Type u_1} {Ω : Type u_2} [DecidableEq α] {mα : MeasurableSpace α} {ν : ProbabilityTheory.Kernel α ℝ} {A : ℕ → Ω → α} {ω : Ω} {a : α} [Finite α] (hr : (fun (x : ℕ) => regret ν A x ω) =o[Filter.atTop ] fun (t : ℕ) => ↑t) (hg : 0 < gap ν a) :

Filter.Tendsto (fun (t : ℕ) => ↑(Learning.pullCount A a t ω) / ↑t) Filter.atTop (nhds 0)

If the regret is sublinear, the rate of suboptimal arm pulls tends to zero.

Documentation

LeanBandits.Bandit.Regret

Regret, gap, best arm #