SFT: Supervised Fine-Tuning—training a model to minimize loss on a labeled dataset of correct examples
RL-FT: Reinforcement Learning Fine-Tuning—optimizing a model to maximize a reward signal (e.g., correct answer) using algorithms like PPO
OOD: Out-of-Distribution—performance on tasks or data variations not seen during specific fine-tuning, testing general reasoning capabilities
ID: In-Distribution—performance on the specific task format used during fine-tuning
SVD: Singular Value Decomposition—a method to factorize a matrix into singular vectors (directions) and singular values (magnitudes/importance)
PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates the model policy while preventing drastic changes to maintain stability
Principal Angles: A geometric measure of how much two subspaces (e.g., defined by weight matrices) have rotated relative to each other
GeneralPoints: A benchmark card game task requiring arithmetic reasoning to reach a target number (like the 24 game)
Singular Vectors: The directional components in SVD (U and V matrices) that define how the weight matrix rotates input data