RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using outcome-based rewards from a verifier (usually a script or another model)
RLPR: Reinforcement Learning with Reference Probability Reward—the proposed framework using intrinsic token probabilities of reference answers as reward
PR: Probability-based Reward—the specific scalar reward calculated from the mean token probabilities of the reference answer
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples for the same prompt to reduce variance
CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate steps before the final answer
MMLU-Pro: A massive multitask language understanding benchmark designed to be more challenging and reasoning-intensive than standard MMLU
TheoremQA: A benchmark assessing the ability to apply theorems to solve complex science problems
Minerva: A benchmark dataset specifically for evaluating mathematical reasoning capabilities
standard deviation filtering: A technique to remove training samples where the model's reward variance is too low, indicating the sample is either trivially easy or impossibly hard
exponential moving average: A statistical calculation to analyze data points by creating a series of averages of different subsets of the full data set, used here to update the filtering threshold