Pruning-aware pretraining: A method where structural pruning happens iteratively during the pretraining phase on large datasets, rather than after training on small calibration sets
Minimal parameter groups: The smallest units of the network (e.g., a specific attention head's coupled projections) that can be removed while maintaining a valid Transformer architecture
Saliency: A measure of the importance of a parameter or group of parameters to the model's loss, often estimated via gradients or Hessian matrices
Hessian matrix: A matrix of second-order partial derivatives used to understand the curvature of the loss landscape; estimating it is key for accurate pruning but computationally expensive
Structured Pruning: Removing entire structural components (like neurons, heads, or channels) rather than individual weights, making the resulting model faster on standard hardware
GQA: Group Query Attention—an attention mechanism where multiple query heads share a single key-value head to reduce memory bandwidth
Common Sense Reasoning: A category of benchmarks (like HellaSwag, ARC) that test a model's ability to use background knowledge to solve problems