Expert Iteration: An algorithm where a model generates samples, correct ones are filtered (rejection sampling), and the model is fine-tuned on these correct samples
PPO: Proximal Policy Optimization—an online RL algorithm that updates a policy while limiting how much it changes from the previous version to ensure stability
SFT: Supervised Fine-Tuning—training a model on a fixed dataset of correct examples using standard cross-entropy loss
maj@1: Accuracy when checking the single greedy decoding output of the model
pass@96: The probability that at least one solution is correct when sampling 96 times from the model
maj@96: Accuracy when sampling 96 times and taking the majority vote of the final answers
rerank@96: Accuracy when sampling 96 times and selecting the best answer using a trained reward model (ORM)
ORM: Outcome-Based Reward Model—a model trained to predict if a partial or full solution will lead to a correct answer
RCRL: Return-Conditioned RL—training a model conditioned on a desired return (reward) token, then prompting it with the high-reward token at inference
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter matrices
KL penalty: A regularizer used in RL to prevent the trained policy from diverging too far from a reference model