RLVR: Reinforcement Learning from Verifiable Rewards—using binary pass/fail signals from code execution as rewards.
Best-of-N (BoN): A test-time scaling strategy where N solutions are generated and the best one is selected based on a ranking method.
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of samples for the same input to reduce variance.
Bradley-Terry loss: A loss function used to train reward models by maximizing the likelihood of the preferred response having a higher score than the rejected one.
Reward Hacking: When an RL policy exploits flaws in the reward model to get high scores without actually improving task performance.
AST: Abstract Syntax Tree—a tree representation of the syntactic structure of source code, used here to verify code validity.
On-policy rollouts: Data generated by the current version of the policy model during training, as opposed to static offline data.
Test-Time Scaling (TTS): Techniques applied during inference (like generating multiple samples) to improve performance without retraining the model.