Structured Pruning: Removing entire architectural components (layers, heads, neurons) rather than individual weights, resulting in a smaller dense model that runs faster on standard hardware
Knowledge Distillation: A training process where a small 'student' model learns to mimic the output probabilities (logits) of a larger 'teacher' model
Logits: The raw, unnormalized prediction scores generated by a neural network before applying the softmax function
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used to align the model's behavior with human preferences
DPO: Direct Preference Optimization—a stable method for fine-tuning LLMs on preference pairs (better/worse outputs) without a separate reward model
Block Influence: A metric used to determine which Transformer layers can be removed; layers that transform the input the least (high cosine similarity between input/output) are pruned
SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs
KL Divergence: Kullback-Leibler Divergence—a statistical measure quantifying how much one probability distribution (student) differs from another (teacher)
H200: NVIDIA H200 Tensor Core GPU—high-performance hardware used for the training/distillation process in this paper