RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs on tasks with clear success criteria (e.g., math, code) using RL
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines using the average reward of a group of outputs rather than a separate critic model
Generalist Value Model (V0): A pre-trained model that estimates the expected score of a prompt via in-context learning, without needing gradient updates during RL
Shrinkage Estimator: A statistical method that combines two estimates (here, prior and empirical mean) to minimize total error (MSE)
OSLA: One-Step-Look-Ahead—a decision strategy that calculates whether the expected benefit of taking one more sample outweighs the cost
Sparse Rollouts: Generating very few samples (e.g., 4) per prompt to save compute, which usually results in high statistical noise
Hallucination (in Value Models): When the value model confidently predicts an incorrect expected return due to out-of-distribution inputs
PPO: Proximal Policy Optimization—standard RL algorithm using a clipped objective and a separate learned value function (critic)
MSE: Mean Squared Error—a measure of the quality of an estimator, combining both its variance (noise) and bias (systematic error)