Bandit #
A stochastic, sequential algorithm.
- policy (n : ℕ) : ProbabilityTheory.Kernel ({ x : ℕ // x ∈ Finset.Iic n } → α × R) α
Policy or sampling rule: distribution of the next action.
- h_policy (n : ℕ) : ProbabilityTheory.IsMarkovKernel (self.policy n)
- p0 : MeasureTheory.Measure α
Distribution of the first action.
- hp0 : MeasureTheory.IsProbabilityMeasure self.p0
Instances For
A stochastic environment.
Distribution of the next observation as function of the past history.
- h_feedback (n : ℕ) : ProbabilityTheory.IsMarkovKernel (self.feedback n)
- ν0 : ProbabilityTheory.Kernel α R
Distribution of the first observation given the first action.
- hp0 : ProbabilityTheory.IsMarkovKernel self.ν0
Instances For
Kernel describing the distribution of the next action-reward pair given the history
up to n
.
Equations
- Learning.stepKernel alg env n = (alg.policy n).compProd (env.feedback n)
Instances For
Equations
- ⋯ = ⋯
Kernel sending a partial trajectory of the bandit interaction Iic n → α × ℝ
to a measure
on ℕ → α × ℝ
, supported on full trajectories that start with the partial one.
Equations
- Learning.traj alg env n = ProbabilityTheory.Kernel.traj (Learning.stepKernel alg env) n
Instances For
Equations
- ⋯ = ⋯
Measure on the sequence of actions and observations generated by the algorithm/environment.
Equations
- Learning.trajMeasure alg env = ProbabilityTheory.Kernel.trajMeasure (alg.p0.compProd env.ν0) (Learning.stepKernel alg env)
Instances For
Equations
- ⋯ = ⋯
action n
is the action pulled at time n
. This is a random variable on the measurable space
ℕ → α × ℝ
.
Equations
- Learning.action n h = (h n).1
Instances For
reward n
is the reward at time n
. This is a random variable on the measurable space
ℕ → α × R
.
Equations
- Learning.reward n h = (h n).2
Instances For
Filtration of the algorithm interaction.
Equations
Instances For
A deterministic algorithm.
Equations
- One or more equations did not get rendered due to their size.
Instances For
A stationary environment, in which the distribution of the next reward depends only on the last action.
Equations
- Learning.stationaryEnv ν = { feedback := fun (x : ℕ) => ProbabilityTheory.Kernel.prodMkLeft ({ x_1 : ℕ // x_1 ∈ Finset.Iic x } → α × R) ν, h_feedback := ⋯, ν0 := ν, hp0 := inst✝ }