RLVR: Reinforcement Learning with Verifiable Rewards—enhancing reasoning by training models to maximize rewards checked by a verifier (e.g., code execution)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt against their group mean
verifier-independent: Training settings where no ground-truth checker (like a math solver or unit test) is available to score the model's outputs
intrinsic confidence: A metric derived from the model itself (specifically, normalized negative entropy) indicating how certain it is about its generation
curriculum learning: Training strategy starting with easy examples and gradually increasing difficulty
action variance: Variance in the gradient estimator arising from the stochasticity of the policy's actions (sampling different tokens)
problem variance: Variance in the gradient estimator arising from the diversity of prompts (different difficulty levels)
bias-variance trade-off: The balance between introducing a systematic error (bias) to reduce random noise (variance) in estimation; VI-CuRL accepts bias early on to lower variance
SFT: Supervised Fine-Tuning—initial training phase on labeled data before RL
KL regularization: Kullback-Leibler divergence penalty used to keep the RL policy from drifting too far from the reference model
stop-gradient: Operation preventing backpropagation through specific variables; used here for the curriculum weights and masks