AceCode-87K: The curated dataset of 87K coding questions and 1.38M validated test cases created by this paper
Bradley-Terry loss: A probabilistic model used to train reward models by predicting the probability that one response is preferred over another based on their score difference
Best-of-N sampling: A test-time inference strategy where N solutions are generated, and a reward model selects the best one
Reinforce++: A variant of the REINFORCE algorithm that eliminates the need for a separate value model during RL, using KL-divergence and rewards directly for advantage estimation
PPO: Proximal Policy Optimization—an RL algorithm that updates policies within a trusted region to ensure stability
Pass@k: A metric measuring the percentage of problems solved where at least one correct solution is found in k samples
SFT: Supervised Fine-Tuning—training a model on labeled (question, code) pairs
KL-divergence: A statistical measure of how one probability distribution differs from a second, reference probability distribution