Algorithms and environments #
We define structures for stochastic, sequential algorithms and environments, and the notion of an algorithm-environment sequence, which is a sequence of actions and rewards generated by an algorithm interacting with an environment.
Main definitions #
Algorithm α R: a stochastic, sequential algorithm.Environment α R: a stochastic environment.IsAlgEnvSeq A R' alg env P: an algorithm-environment sequence. That is, a sequence of actionsAand feedbackR'that have the correct conditional distributions to be generated by an algorithmalginteracting with an environmentenv, defined on a probability space(Ω, P).IsAlgEnvSeqUntil A R' alg env P N:AandR'form an algorithm-environment sequence until timeN.
Main statements #
isAlgEnvSeq_unique: the law of the sequence of actions and observations generated by an algorithm-environment pair is unique: it does not depend on the probability space used. IfA₁,R₁andA₂,R₂are two algorithm-environment sequences generated by the same algorithm-environment pair on probability spaces(Ω, P)and(Ω', P'), thenP.map (fun ω n ↦ (A₁ n ω, R₁ n ω)) = P'.map (fun ω n ↦ (A₂ n ω, R₂ n ω)).
Notes #
The ANCHOR comments are used to mark code that appears in the tutorials.
A stochastic, sequential algorithm.
- policy (n : ℕ) : ProbabilityTheory.Kernel (↥(Finset.Iic n) → α × R) α
Policy or sampling rule: distribution of the next action.
- h_policy (n : ℕ) : ProbabilityTheory.IsMarkovKernel (self.policy n)
- p0 : MeasureTheory.Measure α
Distribution of the first action.
- hp0 : MeasureTheory.IsProbabilityMeasure self.p0
Instances For
A stochastic environment.
- feedback (n : ℕ) : ProbabilityTheory.Kernel ((↥(Finset.Iic n) → α × R) × α) R
Distribution of the next observation as function of the past history.
- h_feedback (n : ℕ) : ProbabilityTheory.IsMarkovKernel (self.feedback n)
- ν0 : ProbabilityTheory.Kernel α R
Distribution of the first observation given the first action.
- hp0 : ProbabilityTheory.IsMarkovKernel self.ν0
Instances For
Kernel describing the distribution of the next action-reward pair given the history
up to n.
Equations
- Learning.stepKernel alg env n = (alg.policy n).compProd (env.feedback n)
Instances For
Equations
- ⋯ = ⋯
History of the algorithm-environment sequence up to time n.
Equations
- Learning.IsAlgEnvSeq.hist A R' n ω i = (A (↑i) ω, R' (↑i) ω)
Instances For
An algorithm-environment sequence: a sequence of actions and rewards generated by an algorithm interacting with an environment.
- measurable_A (n : ℕ) : Measurable (A n)
- measurable_R (n : ℕ) : Measurable (R' n)
- hasLaw_action_zero : ProbabilityTheory.HasLaw (fun (ω : Ω) => A 0 ω) alg.p0 P
- hasCondDistrib_reward_zero : ProbabilityTheory.HasCondDistrib (R' 0) (A 0) env.ν0 P
- hasCondDistrib_action (n : ℕ) : ProbabilityTheory.HasCondDistrib (A (n + 1)) (hist A R' n) (alg.policy n) P
Instances For
An algorithm-environment sequence: a sequence of actions and rewards generated by an algorithm interacting with an environment.
- measurable_A (n : ℕ) : Measurable (A n)
- measurable_R (n : ℕ) : Measurable (R' n)
- hasLaw_action_zero : ProbabilityTheory.HasLaw (fun (ω : Ω) => A 0 ω) alg.p0 P
- hasCondDistrib_reward_zero : ProbabilityTheory.HasCondDistrib (R' 0) (A 0) env.ν0 P
- hasCondDistrib_action (n : ℕ) (hn : n < N) : ProbabilityTheory.HasCondDistrib (A (n + 1)) (IsAlgEnvSeq.hist A R' n) (alg.policy n) P
- hasCondDistrib_reward (n : ℕ) (hn : n < N) : ProbabilityTheory.HasCondDistrib (R' (n + 1)) (fun (ω : Ω) => (IsAlgEnvSeq.hist A R' n ω, A (n + 1) ω)) (env.feedback n) P
Instances For
Filtration generated by the history up to time n.
Equations
- Learning.IsAlgEnvSeq.filtration hA hR' = { seq := fun (i : ℕ) => MeasurableSpace.comap (Learning.IsAlgEnvSeq.hist A R' i) inferInstance, mono' := ⋯, le' := ⋯ }
Instances For
Filtration generated by the history at time n-1 together with the action at time n.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Measure on the sequence of actions and observations generated by the algorithm/environment.
Equations
- Learning.trajMeasure alg env = ProbabilityTheory.Kernel.trajMeasure (alg.p0.compProd env.ν0) (Learning.stepKernel alg env)
Instances For
Equations
- ⋯ = ⋯
The law of the sequence of actions and observations generated by an algorithm-environment pair is unique: it does not depend on the probability space used.