SFT: Supervised Fine-Tuning—training a model to mimic specific target outputs (like reasoning traces) given inputs
RL: Reinforcement Learning—training a model to maximize a reward signal (e.g., correct answer) rather than just mimic text
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to stabilize training
Transferability Index: A metric proposed in this paper to quantify how well improvements in one domain (math) translate to gains in others (coding, general QA)
PCA shift: A measure of how much a model's internal hidden states change directions in feature space after training
KL divergence: Kullback-Leibler divergence—a statistical metric measuring how different two probability distributions (e.g., token predictions) are from each other
CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer
On-policy: Training where the model learns from data generated by its current version (common in RL)
Off-policy: Training where the model learns from static data generated by a previous or different model (common in SFT)
OlympiadBench: A challenging benchmark consisting of Olympiad-level mathematics and physics problems