← Back to Paper List

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

Palmer Schallon
Independent researcher
arXiv (2026)
Pretraining Factuality Benchmark

📝 Paper Summary

Model Compression Model Repair Interpretability
Collapsed attention heads in BLOOM models, caused by ALiBi positional penalties, can be revived through surgical reinitialization and focused retraining rather than being pruned as redundant.
Core Problem
ALiBi positional encoding creates a systematic pathology where 31–44% of attention heads in BLOOM models collapse into 'BOS sinks' (attending solely to the first token) due to steep slope penalties.
Why it matters:
  • Standard compression techniques assume these collapsed heads are redundant and prune them, potentially discarding recoverable model capacity
  • The pathology scales predictably across model sizes (560M to 7.1B), indicating a fundamental architectural flaw in how ALiBi slopes interact with pretraining dynamics
  • Pretrained attention configurations often represent suboptimal local minima rather than necessary functional structures
Concrete Example: In a 16-head BLOOM model, head H15 receives a steep ALiBi slope (approx 0.0039), creating a distance penalty that makes attending to distant tokens energetically unfavorable. Consequently, H15 learns to attend 99% to the Beginning-of-Sequence (BOS) token, becoming functionally inert.
Key Novelty
Surgical Reinitialization and Reoptimization
  • Identifies 'sick' heads using BOS mass and entropy metrics, then resets their weights (Q/K/V) to random initialization while zeroing their output projections to prevent shock
  • Freezes all 'healthy' parameters using gradient masks and trains ONLY the reinitialized heads, allowing them to escape the local minimum without damaging the rest of the model
Evaluation Highlights
  • Recovers 98.7% of operational head capacity in BLOOM-1b7 (increasing healthy heads from 242 to 379 of 384)
  • Surgical model trained on C4 improves validation perplexity on C4 data by 9.6% compared to the stock model (29.30 vs 32.42)
  • Reinitializing healthy heads alongside collapsed ones transiently outperforms the stock model by 25% on training perplexity (12.70 vs 16.99), suggesting pretrained weights are suboptimal
Breakthrough Assessment
8/10
Challenges the dominant 'pruning' paradigm by proving 'dead' heads are repairable and useful. The finding that even healthy heads are in suboptimal local minima is significant for understanding training dynamics.
×