PPO: Proximal Policy Optimization—an RL algorithm that updates a policy in stable steps using a clipped objective
LIM: Learning Impact Measurement—the authors' proposed method for scoring training samples based on how well their individual reward curves align with the model's global learning curve
SFT: Supervised Fine-Tuning—training on labeled examples (input-output pairs) using standard cross-entropy loss
RL: Reinforcement Learning—training models by rewarding correct outputs rather than just mimicking target text
alignment score: A calculated value measuring the correlation between a specific sample's reward trajectory and the model's average reward trajectory
OpenRLHF: An open-source framework for high-performance RLHF training
vLLM: A high-throughput and memory-efficient inference engine for LLMs
MATH500: A subset of the MATH benchmark used for evaluation
AIME24: American Invitational Mathematics Examination 2024—a challenging math competition benchmark
AMC23: American Mathematics Competitions 2023—a math competition benchmark