Lean 4: A functional programming language and interactive theorem prover used for formalizing mathematics
CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate steps before the final answer
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on relative rewards of a group of samples, eliminating the need for a critic model
subgoal: An intermediate lemma or proposition that serves as a stepping stone to proving the main theorem
tactic: A command or instruction in a theorem prover that advances the proof state
expert iteration: A training method where a model generates new data (proofs) which are filtered for correctness and used to retrain/improve the model
Pass@K: A metric measuring the probability that at least one correct solution is found in K generated attempts
cold start: Initial training phase using high-quality synthetic data to establish basic capabilities before reinforcement learning
RL: Reinforcement Learning—training models by rewarding desired behaviors (correct proofs) and penalizing incorrect ones