Best-of-N: A strategy where a model generates N candidate solutions, and a separate mechanism (verifier/reward model) selects the best one
Unit Test: A pair of input and expected output used to verify if a piece of code functions correctly
SFT: Supervised Fine-Tuning—training a model on a specific dataset to adapt it for a particular task
Pass@1: The percentage of problems where the model's single generated solution is correct
Probe: A lightweight classifier trained on the internal hidden states of a model to predict a specific property (here, problem difficulty)
Greedy Algorithm: An optimization strategy that makes the locally optimal choice at each step (here, allocating budget to the problem where it yields the highest expected reward increase)
Test-time computation: Spending more computational resources during inference (e.g., generating more candidates or tests) to improve performance
HumanEval Plus: A rigorously enhanced version of the HumanEval code generation benchmark with more comprehensive test cases to prevent false positives