← Back to Paper List

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

Nathan Godey, Yoav Artzi
arXiv (2026)
Pretraining Benchmark

📝 Paper Summary

Language Model Optimization Neural Network Training Dynamics
The dimension mismatch between hidden states and vocabulary size creates a gradient bottleneck that destroys over 95% of the backpropagation signal, severely hampering optimization efficiency independent of model expressivity.
Core Problem
The standard language model output layer projects a small hidden dimension D to a large vocabulary size V, forcing high-dimensional gradients to be compressed through a low-rank linear mapping during backpropagation.
Why it matters:
  • Backpropagating V-dimensional gradients through a rank-D layer induces unavoidable lossy compression, altering feedback for the vast majority of parameters
  • This bottleneck persists regardless of backbone architecture (Transformer, RNN, SSM), potentially causing training inefficiencies at scale
  • Prior work focused on the 'softmax bottleneck' only as a limit on output expressivity (probability ranks), ignoring its destructive effect on optimization dynamics
Concrete Example: In a synthetic language task where patterns are trivial but the vocabulary is large, a model with a standard low-rank LM head fails to learn the patterns because the gradient update direction is constrained to rank 2D, while the optimal update direction is high-rank (near V).
Key Novelty
The LM Head as a Gradient Bottleneck
  • Reformulates the 'softmax bottleneck' problem: instead of just limiting what probability distributions the model can represent (expressivity), it limits the rank of the gradient update applied during training (optimization)
  • Demonstrates theoretically that the gradient update matrix is constrained to rank 2D, whereas the ideal logit gradient is full-rank (near V), leading to massive signal loss via the Eckart-Young-Mirsky theorem
Evaluation Highlights
  • 95-99% of the gradient norm is suppressed by the output layer during backpropagation in empirical measurements
  • Reduces LLM training efficiency by up to 16x for the same backbone due to suboptimal update directions
  • Strong bottlenecks make trivial patterns in synthetic languages unlearnable, even when the model is theoretically expressive enough to solve them
Breakthrough Assessment
9/10
Identifies a fundamental, overlooked flaw in the standard design of almost all modern language models. Shifting the view of the softmax bottleneck from expressivity to optimization could spur major architectural changes.
×