Bookkeeping definitions for finite action space sequential learning problems #
If the number of actions is finite, it makes sense to define the number of times each action was chosen, the time at which an action was chosen for the nth time, the value of the reward at that time, the sum of rewards obtained for each action, the empirical mean reward for each action, etc.
For each definition that take as arguments a time t : ℕ, a history h : ℕ → α × R, and possibly
other parameters, we put the time and history at the end in this order, so that the definition can
be seen as a stochastic process indexed by time t on the measurable space ℕ → α × R.
Number of times action a was chosen up to time t (excluding t).
Equations
- Learning.pullCount A a t ω = {s ∈ Finset.range t | A s ω = a}.card
Instances For
Number of pulls of arm a up to (and including) time n.
This is the number of entries in h in which the arm is a.
Equations
- Learning.pullCount' n h a = {s : ↥(Finset.Iic n) | (h s).1 = a}.card
Instances For
Number of steps until action a was pulled exactly m times.
Equations
Instances For
If we pull action a at time 0, the first time at which it is pulled once is 0.
stepsUntil a m is a stopping time with respect to the filtration filtrationAction.
Reward obtained when pulling action a for the m-th time.
If it is never pulled m times, the reward is given by the second component of ω, which in
applications will be indepedent with same law.
Equations
- Learning.rewardByCount A R' a m ω = match Learning.stepsUntil A a m ω.1 with | none => ω.2 m a | some n => R' n ω.1
Instances For
The value at 0 does not matter (it would be the "zeroth" reward). It should be considered a junk value.
Sum of rewards obtained when pulling action a up to time t (exclusive).
Equations
- Learning.sumRewards A R' a t ω = ∑ s ∈ Finset.range t, if A s ω = a then R' s ω else 0
Instances For
Sum of rewards of arm a up to (and including) time n.
Equations
- Learning.sumRewards' n h a = ∑ s : ↥(Finset.Iic n), if (h s).1 = a then (h s).2 else 0
Instances For
Empirical mean reward obtained when pulling action a up to time t (exclusive).
Equations
- Learning.empMean A R' a t ω = Learning.sumRewards A R' a t ω / ↑(Learning.pullCount A a t ω)
Instances For
Empirical mean of arm a at time n.
Equations
- Learning.empMean' n h a = Learning.sumRewards' n h a / ↑(Learning.pullCount' n h a)