2. Defining an Algorithm🔗

This tutorial explains the structures used in the Lean Machine Learning library to describe algorithms, the environment they interact with, and how to state theorems about their interaction. We then illustrate them with the UCB bandit algorithm.

2.1. Algorithm and environment🔗

In LML, we prove theorems about the interaction of an algorithm with an environment. An algorithm takes actions, to which the environment responds with feedback (e.g., rewards for the bandit case, the gradient of a function in optimization problems). In general, both action and feedback can depend on the entire history up to the current time and can be randomized. The Algorithm structure is defined as follows:

structure Algorithm (α R : Type*) [MeasurableSpace α] [MeasurableSpace R] where
  /-- Policy or sampling rule: distribution of the next action. -/
  policy : (n : ℕ) → Kernel (Iic n → α × R) α
  [h_policy : ∀ n, IsMarkovKernel (policy n)]
  /-- Distribution of the first action. -/
  p0 : Measure α
  [hp0 : IsProbabilityMeasure p0]

This structure refers to two types, the type of actions α and the type of feedback R. Both are measurable spaces, since we consider stochastic algorithms and environments. The interaction will start with the algorithm playing a first action, which is in general random with distribution p0. The field hp0 registers that p0 is a probability measure (and it is in square brackets to tell Lean to infer it automatically whenever possible). After time n, there is a history of actions and feedbacks Iic n → α × R (n+1 pairs action and feedback). So after time 0 (the processes are 0-indexed) the history contains the action at time 0 and the feedback that followed. The policy field contain for each time n a kernel from that history to the action space. That is, it maps every possible history to a random next action (and that map is measurable). The h_policy field records that the measure describing the next action is a probability measure.

If the algorithms actions are not random, we can use the detAlgorithm definition to build an algorithm from the data of a measurable function for the next action and a choice for the first action.

def detAlgorithm (nextAction : (n : ℕ) → (Iic n → α × R) → α)
    (h_next : ∀ n, Measurable (nextAction n)) (action0 : α) :
    Algorithm α R where
  policy n := Kernel.deterministic (nextAction n) (h_next n)
  p0 := Measure.dirac action0

We can see here that we did not need to prove that the kernels are IsMarkovKernel and that the distribution of the first action is a probability measure. Lean knows that deterministic kernels are Markov.

The Environment structure is the mirror of the Algorithm structure, with a kernel for the feedback instead of the actions and a kernel for the first feedback instead of the first action.

structure Environment (α R : Type*) [MeasurableSpace α] [MeasurableSpace R] where
  /-- Distribution of the next observation as function of the past history. -/
  feedback : (n : ℕ) → Kernel ((Iic n → α × R) × α) R
  [h_feedback : ∀ n, IsMarkovKernel (feedback n)]
  /-- Distribution of the first observation given the first action. -/
  ν0 : Kernel α R
  [hp0 : IsMarkovKernel ν0]

ν0 gives the distribution of the first feedback given the first action, and feedback gives the distribution of the next feedback given the history and the next action.

In many applications the feedback depends only on the last action and not on the prior history. We provide a stationaryEnv definition that builds an environment for those cases.

def stationaryEnv (ν : Kernel α R) [IsMarkovKernel ν] : Environment α R where
  feedback _ := ν.prodMkLeft _
  ν0 := ν

ν.prodMkLeft _ is the kernel ν seen as a Kernel ((Iic n → α × R) × α) R by ignoring the history.

2.2. Sequences of actions and feedback, probability space🔗

Once algorithm and environment are defined, we can introduce sequences of actions and feedback and assume that they are generated by the interaction of the algorithm with the environment. This is done by the IsAlgEnvSeq structure.

structure IsAlgEnvSeq
    [StandardBorelSpace α] [Nonempty α] [StandardBorelSpace R] [Nonempty R]
    (A : ℕ → Ω → α) (R' : ℕ → Ω → R) (alg : Algorithm α R) (env : Environment α R)
    (P : Measure Ω) [IsFiniteMeasure P] : Prop where
  measurable_A n : Measurable (A n) := by fun_prop
  measurable_R n : Measurable (R' n) := by fun_prop
  hasLaw_action_zero : HasLaw (fun ω ↦ (A 0 ω)) alg.p0 P
  hasCondDistrib_reward_zero : HasCondDistrib (R' 0) (A 0) env.ν0 P
  hasCondDistrib_action n :
    HasCondDistrib (A (n + 1)) (IsAlgEnvSeq.hist A R' n) (alg.policy n) P
  hasCondDistrib_reward n :
    HasCondDistrib (R' (n + 1)) (fun ω ↦ (IsAlgEnvSeq.hist A R' n ω, A (n + 1) ω))
      (env.feedback n) P

This structure takes as input two sequences of random variables (two stochastic processes), A and R', which represent the actions and feedback generated by the interaction of the algorithm with the environment. It states that those sequences are measurable and that they have the correct conditional distributions given by the algorithm and environment. The measurable space Ω and the measure P are not imposed: they can be chosen as we want, as long as the conditions of IsAlgEnvSeq are satisfied. This definition requires α and R to be nonempty standard Borel spaces, because Mathlib's theory about conditional distributions requires those assumptions. All spaces of interest in machine learning are standard Borel, so this is not a restriction.

Given any algorithm and environment, there always exists a sequence of actions and feedback that satisfies IsAlgEnvSeq by the Ionescu-Tulcea theorem. However other constructions of such sequences are possible, and it is easier to work with generic sequences satisfying IsAlgEnvSeq than with a specific construction. Which sequence we choose does not matter for the results we prove, since all such sequences are equal in distribution, as stated by the following theorem.

theorem isAlgEnvSeq_unique (h1 : IsAlgEnvSeq A₁ R₁ alg env P)
    (h2 : IsAlgEnvSeq A₂ R₂ alg env P') :
    P.map (fun ω n ↦ (A₁ n ω, R₁ n ω)) = P'.map (fun ω n ↦ (A₂ n ω, R₂ n ω)) := byα:Type u_1R:Type u_2mα:MeasurableSpace αmR:MeasurableSpace RΩ:Type u_4Ω':Type u_5mΩ:MeasurableSpace ΩmΩ':MeasurableSpace Ω'inst✝⁵:StandardBorelSpace αinst✝⁴:Nonempty αinst✝³:StandardBorelSpace Rinst✝²:Nonempty Ralg:Algorithm α Renv:Environment α RP:Measure Ωinst✝¹:IsProbabilityMeasure PP':Measure Ω'inst✝:IsProbabilityMeasure P'A₁:ℕ → Ω → αR₁:ℕ → Ω → RA₂:ℕ → Ω' → αR₂:ℕ → Ω' → Rh1:IsAlgEnvSeq A₁ R₁ alg env Ph2:IsAlgEnvSeq A₂ R₂ alg env P'⊢ Measure.map (fun ω n => (A₁ n ω, R₁ n ω)) P = Measure.map (fun ω n => (A₂ n ω, R₂ n ω)) P'

2.3. Example: the UCB algorithm and a stochastic bandit environment🔗

We now illustrate the use of Algorithm, Environment, and IsAlgEnvSeq by defining the UCB bandit algorithm, a bandit environment, and stating a theorem about the regret of UCB in that environment.

In a stochastic bandit, an algorithm chooses at each time an action from a finite set (here Fin K, the type of natural numbers less than K) and receives a reward drawn from a distribution that depends only on the action, not on the prior history.

The environment is thus simply stationaryEnv ν for some kernel ν : Kernel (Fin K) ℝ.

2.3.1. Algorithm🔗

The UCB algorithm chooses at time n + 1 the action that maximizes the sum of the empirical mean reward and an exploration bonus. It starts by choosing each action once and then chooses \arg\max_a (\hat{\mu}_{n,a} + \sqrt{\frac{2c \log (n + 2)}{N_{n,a}}}), in which \hat{\mu}_{n,a} is the empirical mean reward of action a at time n (empMean' in the code), N_{n,a} is the number of times action a has been chosen up to time n (pullCount' in the code), and c is a parameter of the algorithm.

To define the algorithm, we first define the exploration bonus and the next action function, and then we use detAlgorithm to build the algorithm. We also need to prove that the next action function is measurable, which is done by the measurable_nextArm lemma. Note that we are careful to use a measurable version of the argmax function, measurableArgmax.

/-- The exploration bonus of the UCB algorithm, which corresponds to the width of
a confidence interval. -/
noncomputable def ucbWidth' (c : ℝ) (n : ℕ) (h : Iic n → Fin K × ℝ) (a : Fin K) : ℝ :=
  √(2 * c * log (n + 2) / pullCount' n h a)

open Classical in
/-- Arm pulled by the UCB algorithm at time `n + 1`. -/
noncomputable
def UCB.nextArm (hK : 0 < K) (c : ℝ) (n : ℕ) (h : Iic n → Fin K × ℝ) : Fin K :=
  have : Nonempty (Fin K) := Fin.pos_iff_nonempty.mp hK
  if n < K - 1 then ⟨(n + 1) % K, Nat.mod_lt _ hK⟩ else
  measurableArgmax (fun h a ↦ empMean' n h a + ucbWidth' c n h a) h

@[fun_prop]
lemma UCB.measurable_nextArm (hK : 0 < K) (c : ℝ) (n : ℕ) : Measurable (nextArm hK c n) := byK:ℕhK:0 < Kc:ℝn:ℕ⊢ Measurable (nextArm hK c n)
  refine Measurable.ite (byK:ℕhK:0 < Kc:ℝn:ℕ⊢ MeasurableSet {a | n < K - 1} simpAll goals completed! 🐙) (byK:ℕhK:0 < Kc:ℝn:ℕ⊢ Measurable fun x => ⟨(n + 1) % K, ⋯⟩ fun_propAll goals completed! 🐙) ?_
  have : Nonempty (Fin K) := Fin.pos_iff_nonempty.mp hKK:ℕhK:0 < Kc:ℝn:ℕthis:Nonempty (Fin K)⊢ Measurable (measurableArgmax fun h a => empMean' n h a + ucbWidth' c n h a)
  refine measurable_measurableArgmax fun a ↦ ?_K:ℕhK:0 < Kc:ℝn:ℕthis:Nonempty (Fin K)a:Fin K⊢ Measurable fun x => empMean' n x a + ucbWidth' c n x a
  unfold ucbWidth'K:ℕhK:0 < Kc:ℝn:ℕthis:Nonempty (Fin K)a:Fin K⊢ Measurable fun x => empMean' n x a + √(2 * c * log (↑n + 2) / ↑(pullCount' n x a))
  fun_propAll goals completed! 🐙

/-- The UCB algorithm. -/
noncomputable
def ucbAlgorithm (hK : 0 < K) (c : ℝ) : Algorithm (Fin K) ℝ :=
  detAlgorithm (UCB.nextArm hK c) (byK:ℕhK:0 < Kc:ℝ⊢ ∀ (n : ℕ), Measurable (UCB.nextArm hK c n) fun_propAll goals completed! 🐙) ⟨0, hK⟩

The last line builds the algorithm using detAlgorithm and the function UCB.nextArm. Its measurability is proved by the fun_prop tactic, which proves measurability of functions by using lemmas tagged with @[fun_prop]. The last argument ⟨0, hK⟩ is the first action of the algorithm, which is 0 as an element of Fin K.

2.3.2. A theorem about UCB🔗

We can now state a theorem about the regret of UCB in a stochastic bandit environment (which we won't prove here). Let's first define the regret, which for stochastic bandits is the difference between the mean rewar that the algorithm would have obtained if it played always the best action, and the sum of mean rewards of the actions played.

def regret (ν : Kernel α ℝ) (A : ℕ → Ω → α) (t : ℕ) (ω : Ω) : ℝ :=
  t * (⨆ a, (ν a)[id]) - ∑ s ∈ range t, (ν (A s ω))[id]

The quantity (ν a)[id] is the mean reward of action a in the environment defined by ν.

We can now state the regret bound for UCB.

lemma regret_le [Nonempty (Fin K)]
    (h : IsAlgEnvSeq A R (ucbAlgorithm hK (c * σ2)) (stationaryEnv ν) P)
    (hν : ∀ a, HasSubgaussianMGF (fun x ↦ x - (ν a)[id]) σ2 (ν a))
    (hσ2 : σ2 ≠ 0) (hc : 0 < c) (n : ℕ) :
    P[regret ν A n] ≤
      ∑ a, (8 * c * σ2 * log (n + 1) / gap ν a + gap ν a * (2 + 2 * (constSum c n).toReal)) := byK:ℕhK:0 < Kc:ℝν:Kernel (Fin K) ℝinst✝²:IsMarkovKernel νΩ:Type u_1mΩ:MeasurableSpace ΩP:Measure Ωinst✝¹:IsProbabilityMeasure PA:ℕ → Ω → Fin KR:ℕ → Ω → ℝσ2:ℝ≥0inst✝:Nonempty (Fin K)h:IsAlgEnvSeq A R (ucbAlgorithm hK (c * ↑σ2)) (stationaryEnv ν) Phν:∀ (a : Fin K), HasSubgaussianMGF (fun x => x - ∫ (x : ℝ), id x ∂ν a) σ2 (ν a)hσ2:σ2 ≠ 0hc:0 < cn:ℕ⊢ ∫ (x : Ω), regret ν A n x ∂P ≤ ∑ a, (8 * c * ↑σ2 * log (↑n + 1) / gap ν a + gap ν a * (2 + 2 * (constSum c n).toReal))

The arguments of the theorem are the following:

h states that the sequence of actions and rewards we are considering is generated by the interaction of UCB with parameter c * σ2 with the stationary environment defined by ν.
hν states that the reward distribution of each arm is subgaussian with variance proxy σ2.
hσ2 and hc state that the parameters σ2 and c are positive.
n is the time horizon.

The theorem gives an upper bound on the expected regret of UCB at time n.

2.4. Building vs analyzing algorithms🔗

When building an algorithm, we describe it with functions from the history (Iic n → α × R) to the action space α. Thus, to construct UCB, we used the following empirical mean function.

def empMean' (n : ℕ) (h : Iic n → α × ℝ) (a : α) :=
  (sumRewards' n h a) / (pullCount' n h a)

When analyzing an algorithm, we work with sequences of actions and rewards A : ℕ → Ω → α and R' : ℕ → Ω → R that satisfy IsAlgEnvSeq. For the analysis, the empirical mean is defined as a stochastic process on the same probability space Ω.

def empMean (A : ℕ → Ω → α) (R' : ℕ → Ω → ℝ) (a : α) (t : ℕ) (ω : Ω) : ℝ :=
  sumRewards A R' a t ω / pullCount A a t ω

empMean A R' a is a stochastic process with type ℕ → Ω → ℝ that gives the empirical mean of action a at each time.