Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

📝 Paper Summary

Model Compression Model Repair Interpretability

Collapsed attention heads in BLOOM models, caused by ALiBi positional penalties, can be revived through surgical reinitialization and focused retraining rather than being pruned as redundant.

Core Problem

ALiBi positional encoding creates a systematic pathology where 31–44% of attention heads in BLOOM models collapse into 'BOS sinks' (attending solely to the first token) due to steep slope penalties.

Why it matters:

Standard compression techniques assume these collapsed heads are redundant and prune them, potentially discarding recoverable model capacity
The pathology scales predictably across model sizes (560M to 7.1B), indicating a fundamental architectural flaw in how ALiBi slopes interact with pretraining dynamics
Pretrained attention configurations often represent suboptimal local minima rather than necessary functional structures

Concrete Example: In a 16-head BLOOM model, head H15 receives a steep ALiBi slope (approx 0.0039), creating a distance penalty that makes attending to distant tokens energetically unfavorable. Consequently, H15 learns to attend 99% to the Beginning-of-Sequence (BOS) token, becoming functionally inert.

Key Novelty

Surgical Reinitialization and Reoptimization

Identifies 'sick' heads using BOS mass and entropy metrics, then resets their weights (Q/K/V) to random initialization while zeroing their output projections to prevent shock
Freezes all 'healthy' parameters using gradient masks and trains ONLY the reinitialized heads, allowing them to escape the local minimum without damaging the rest of the model

Evaluation Highlights

Recovers 98.7% of operational head capacity in BLOOM-1b7 (increasing healthy heads from 242 to 379 of 384)
Surgical model trained on C4 improves validation perplexity on C4 data by 9.6% compared to the stock model (29.30 vs 32.42)
Reinitializing healthy heads alongside collapsed ones transiently outperforms the stock model by 25% on training perplexity (12.70 vs 16.99), suggesting pretrained weights are suboptimal

Breakthrough Assessment

8/10

Challenges the dominant 'pruning' paradigm by proving 'dead' heads are repairable and useful. The finding that even healthy heads are in suboptimal local minima is significant for understanding training dynamics.

⚙️ Technical Details

Problem Definition

Setting: Restoring functional capacity to degenerate attention mechanisms in pretrained transformer language models

Inputs: Pretrained BLOOM model weights

Outputs: Surgically repaired BLOOM model with restored attention heads

Pipeline Flow

Input Embedding + ALiBi
Transformer Block (Self-Attention + MLP)
Output Projection

System Modules

ALiBi Positional Encoding

Injects positional information via biases added to attention scores

Model or implementation: Static linear slopes

Self-Attention Heads

Process token relationships; divided into 'Healthy' and 'Surgical' groups during repair

Model or implementation: BLOOM Attention (16-32 heads depending on scale)

Novel Architectural Elements

Surgical intervention logic: The pipeline technically remains a standard Transformer, but the *training* pipeline introduces a novel 'Gradient Masked' topology where specific heads are isolated for retraining while the residual stream is protected by zero-initialized output projections.

Modeling

Base Model: BLOOM-1b7 (1.7 billion parameters)

Training Method: Surgical Reinitialization (Targeted Re-training)

Objective Functions:

Purpose: Standard causal language modeling.

Formally: Minimize negative log-likelihood of the next token.

Trainable Parameters: Pass 1: 17.5% of parameters (heads H9-H15); Pass 2: 2.3% of parameters (outliers)

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 8 (effective)
sequence_length: 512
+ 4 more
precision: bfloat16
optimizer: AdamW
weight_decay: 0
gradient_clipping: 1.0

Compute: Single NVIDIA RTX 5070 Ti (16GB VRAM)

Comparison to Prior Work

vs. Sink-aware pruning: Repairs and recovers capacity instead of removing it; proves heads are not redundant
vs. Softmax relaxation: Repairs *existing* pretrained models rather than requiring a new architecture from scratch
vs. Michel et al.: Demonstrates that 'redundant' heads can be made functional and improve performance

Limitations

Repairing with a small curated corpus improves in-domain PPL but degrades generalization on held-out C4 data (specialization trade-off)
Long-term training on noisy data (C4) causes 'local degradation' where frozen heads drift due to noise propagation
Two heads at L23 (H12, H14) became BOS-sinks as a side effect of the first surgery pass (iatrogenic effect)
H5 healthy head reinitialization experiment was limited to a single column and seed

Reproducibility

Code, checkpoints, and diagnostic tools are released as open-source. Diagnostic prompts and 150 example completions are available. Training relies on 'curated' corpus (details sparse in snippet) or C4 validation split (public).

📊 Experiments & Results

Evaluation Setup

Repairing BLOOM-1b7 attention heads and measuring perplexity (PPL) and attention distribution metrics

Benchmarks:

C4 Validation Split (Language Modeling (General Domain))
Curated Training Corpus (Language Modeling (Specific Domain)) [New]

Metrics:

Perplexity (PPL)
BOS mass (fraction of attention on position 0)
Head Recovery Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of model perplexity shows surgical repair improves performance on the data distribution used for the repair.
C4 Validation Data	Perplexity (Lower is better)	32.42	29.30	-3.12
Curated Training Data	Perplexity (Lower is better)	16.99	15.10	-1.89
Training Corpus	Perplexity (Lower is better)	16.99	12.70	-4.29
BLOOM-1b7 Architecture	Functional Heads Count	242	379	+137

Experiment Figures

Distribution of BOS mass per head in BLOOM models

Heatmap of 'sick' (collapsed) heads across model scales

Drift analysis showing change in BOS mass for frozen heads during surgery

Main Takeaways

BOS-sink collapse is not redundancy but a reversible pathology caused by ALiBi slopes; collapsed heads can be 'woken up' via reinitialization
Head recovery is driven by the reinitialization mechanism, not corpus content (both C4 and curated corpora achieved 108/108 recovery)
Successful surgery induces 'functional redistribution' where the global attention topology reorganizes early in training to accommodate new heads
Continued training on noisy data (C4) eventually causes 'local degradation', where noise propagates to frozen heads in the same ALiBi slope columns
Even 'healthy' heads in pretrained models are in suboptimal local minima; reinitializing them allows discovery of better configurations (25% lower PPL)

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanisms (Query/Key/Value projections)
Positional encodings (specifically ALiBi)
Language model pretraining dynamics
Gradient descent and initialization strategies

Key Terms

ALiBi: Attention with Linear Biases—a positional encoding scheme that biases attention scores based on the distance between tokens using static slopes

BOS sink: An attention head that directs the vast majority of its attention mass to the Beginning-of-Sequence (BOS) token, often rendering it functionally useless for context processing

BOS mass: The fraction of total attention weight a head assigns to the position 0 (start) token

Gradient masking: A technique where gradients for specific parameters are zeroed out during backpropagation, effectively freezing those weights while allowing others to train

Entropy: Shannon entropy measured on the attention distribution; low entropy indicates the head is focusing on very few tokens (often just the BOS)

Xavier normal initialization: A method of initializing neural network weights with random values drawn from a normal distribution scaled by the layer size, used here to reset collapsed heads

PPL: Perplexity—a metric measuring how well a probability model predicts a sample; lower values indicate better prediction