PG: Policy Gradient—an RL algorithm that optimizes a policy by following the gradient of expected reward
Outcome Reward: Feedback provided only at the very end of a generated sequence (e.g., correct/incorrect final answer)
Process Reward: Feedback provided at intermediate steps (e.g., per-token or per-step correctness)
Likelihood Quantile: A proposed theoretical property of the base model that determines the sample complexity of post-training; essentially how 'covered' the correct solution is by the base distribution
Margin condition: A geometric assumption stating that the correct class/token is separated from others by a gap of at least γ in the feature space
SGD: Stochastic Gradient Descent—the standard optimization algorithm used for pre-training the base model
Base Model Barrier: The theoretical finding that outcome-based RL cannot efficiently learn samples that have negligible probability under the base model
SFT: Supervised Fine-Tuning—training on labeled demonstrations
VC dimension: Vapnik–Chervonenkis dimension—a measure of the capacity or complexity of a space of functions