IFPruning: Instruction-Following Pruning—the proposed method where a predictor selects model parameters based on the prompt
SoftTopK: A differentiable operator that approximates the Top-K selection, allowing gradients to flow back to the mask predictor during training
Structured Pruning: Removing entire structural components (like neurons or channels) rather than individual weights, leading to actual speedups on hardware
Contextual Sparsity: The phenomenon where different inputs utilize different sub-networks within a model
FFN: Feed-Forward Network—the dense layers within a Transformer block, often the target of pruning due to their size
HardConcrete: A relaxation of discrete distributions used in prior work (like Sheared LLaMA) to make binary masks differentiable
SFT: Supervised Fine-Tuning
Time-to-first-token: The latency between sending a request and receiving the first generated token
Per-task pruning: Generating a single mask for a specific task definition and reusing it for all inputs of that task type