VAPO: A novel actor-critic RL framework introduced in this paper to stabilize training
DAPO: A novel policy-gradient RL framework (without critic) introduced in this paper for stable optimization
MoE: Mixture-of-Experts—a model architecture that activates only a subset of parameters (experts) per token to save compute
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training the model on labeled data before RL
Pass@k: A metric measuring the percentage of problems solved where at least one correct solution is found in k attempts
Process Reward Model: A reward model that evaluates the intermediate steps or the final answer reasoning process, rather than just the final output
WoN: Worst of N—a metric used here for data cleaning, where problems are removed if the model gets them right even in its worst attempt
Elo Score: A comparative ranking system often used in competitive programming; this paper avoids it in favor of direct pass rates due to estimation noise