Policy Entropy: A measure of randomness in the policy's action selection; high entropy means high uncertainty/exploration, low entropy means high confidence
Policy Gradient: An RL algorithm that updates the policy parameters in the direction of higher expected reward
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards across a group of outputs for the same prompt
RLOO: Reinforce Leave-One-Out—an RL baseline estimator
PRIME: Process Reinforcement through Implicit McCormick Envelopes—an RL method mentioned as a baseline
PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates to ensure stability
Scaling Laws: Empirical relationships describing how model performance scales with parameters, data, or compute
KL Divergence: A statistical distance measuring how one probability distribution differs from a second, reference distribution
Covariance: In this context, the statistical relationship between the probability of an action and its logit change (proportional to advantage)
Logit: The raw, unnormalized scores output by the final layer of a neural network before applying softmax