Regret, gap, best arm #
noncomputable def
Bandits.gap
{α : Type u_1}
{mα : MeasurableSpace α}
(ν : ProbabilityTheory.Kernel α ℝ)
(a : α)
:
Gap of an action a: difference between the highest mean of the actions and the mean of a.
Instances For
theorem
Bandits.gap_nonneg
{α : Type u_1}
{mα : MeasurableSpace α}
{ν : ProbabilityTheory.Kernel α ℝ}
{a : α}
[Fintype α]
:
noncomputable def
Bandits.regret
{α : Type u_1}
{Ω : Type u_2}
{mα : MeasurableSpace α}
(ν : ProbabilityTheory.Kernel α ℝ)
(A : ℕ → Ω → α)
(t : ℕ)
(ω : Ω)
:
Regret of a sequence of pulls k : ℕ → α at time t for the reward kernel ν ; Kernel α ℝ.
Equations
Instances For
theorem
Bandits.regret_eq_sum_gap
{α : Type u_1}
{Ω : Type u_2}
{mα : MeasurableSpace α}
{ν : ProbabilityTheory.Kernel α ℝ}
{A : ℕ → Ω → α}
{ω : Ω}
{t : ℕ}
:
theorem
Bandits.regret_nonneg
{α : Type u_1}
{Ω : Type u_2}
{mα : MeasurableSpace α}
{ν : ProbabilityTheory.Kernel α ℝ}
{A : ℕ → Ω → α}
{ω : Ω}
{t : ℕ}
[Fintype α]
:
theorem
Bandits.regret_eq_sum_pullCount_mul_gap
{α : Type u_1}
{Ω : Type u_2}
[DecidableEq α]
{mα : MeasurableSpace α}
{ν : ProbabilityTheory.Kernel α ℝ}
{A : ℕ → Ω → α}
{ω : Ω}
{t : ℕ}
[Fintype α]
:
noncomputable def
Bandits.bestArm
{α : Type u_1}
{mα : MeasurableSpace α}
[Fintype α]
[Nonempty α]
(ν : ProbabilityTheory.Kernel α ℝ)
:
α
action with the highest mean.
Equations
- Bandits.bestArm ν = ⋯.choose
Instances For
@[simp]
theorem
Bandits.gap_bestArm
{α : Type u_1}
{mα : MeasurableSpace α}
{ν : ProbabilityTheory.Kernel α ℝ}
[Fintype α]
[Nonempty α]
:
theorem
Bandits.avg_mean_reward_tendsto_of_sublinear_regret
{α : Type u_1}
{Ω : Type u_2}
{mα : MeasurableSpace α}
{ν : ProbabilityTheory.Kernel α ℝ}
{A : ℕ → Ω → α}
{ω : Ω}
(hr : (fun (x : ℕ) => regret ν A x ω) =o[Filter.atTop] fun (t : ℕ) => ↑t)
:
If the regret is sublinear, the average mean reward tends to the highest mean of the arms.
theorem
Bandits.pullCount_rate_tendsto_of_sublinear_regret
{α : Type u_1}
{Ω : Type u_2}
[DecidableEq α]
{mα : MeasurableSpace α}
{ν : ProbabilityTheory.Kernel α ℝ}
{A : ℕ → Ω → α}
{ω : Ω}
{a : α}
[Fintype α]
(hr : (fun (x : ℕ) => regret ν A x ω) =o[Filter.atTop] fun (t : ℕ) => ↑t)
(hg : 0 < gap ν a)
:
Filter.Tendsto (fun (t : ℕ) => ↑(Learning.pullCount A a t ω) / ↑t) Filter.atTop (nhds 0)
If the regret is sublinear, the rate of suboptimal arm pulls tends to zero.