pass@k: The probability that at least one of k generated samples is correct
GRPO: Group Relative Policy Optimization—an RL algorithm that updates policies based on the relative rewards of a group of samples for the same prompt
elliptical bonus: A novelty score derived from linear regression theory (specifically the inverse covariance matrix of seen features) indicating how well a new feature vector is covered by previous data
verifier efficiency: The number of verifier queries (samples checked) required to find a correct solution; higher efficiency means fewer queries needed
diversity collapse: A phenomenon where RL-trained models lose the ability to generate diverse responses, leading to worse performance when many samples are allowed (high k)
coreset: A small, representative subset of data points that approximates the properties of the full dataset
hidden states: Internal vector representations of the input text within the neural network layers, before the final output generation
sharpening: The tendency of RL to simply increase the probability of already-known high-likelihood behaviors rather than discovering new ones