SFT: Supervised Fine-Tuning—training a model to mimic expert demonstrations using cross-entropy loss
RLVR: Reinforcement Learning with Verifiable Rewards—an RL phase using ground-truth verifiers (e.g., math answers) to optimize reasoning
Gibbs distribution: A probability distribution where the probability of a state is proportional to the exponential of its energy (or reward) divided by temperature
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt
OOD: Out-of-Distribution—tasks or data significantly different from the training set, used to test generalization
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution
Inverse temperature (β): A parameter controlling the sharpness of a distribution; high β (low temperature) makes the distribution peaky (deterministic), while low β flattens it
Pass@k: An evaluation metric measuring the probability that at least one of the k generated solutions is correct