← Back to Paper List

Marginals Before Conditionals

Mihir Sahasrabudhe
School of Information Sciences, University of Illinois Urbana-Champaign
arXiv (2026)
Pretraining Factuality Reasoning

📝 Paper Summary

Training Dynamics Mechanistic Interpretability Phase Transitions in Learning
Neural networks learn simple marginal statistics before complex conditional relationships, getting stuck in a plateau where gradient noise stabilizes the suboptimal solution until a collective transition occurs.
Core Problem
Neural networks often exhibit delayed generalization, learning simple heuristics (marginals) long before grasping the full task structure (conditionals), but the mechanisms controlling this delay are poorly understood.
Why it matters:
  • Models often fail to exploit information present in their inputs (as seen in the reversal curse), getting stuck on partial solutions
  • Understanding these delays is crucial for explaining 'grokking' and other sudden capabilities that emerge only after long training plateaus
  • Current optimization theories do not fully explain why models linger on saddle points that have clear escape directions
Concrete Example: In a task where a base string B maps to K different targets A, but a selector token z resolves the ambiguity, the model first learns the ambiguous mapping B->A (high loss), ignoring z entirely. It plateaus there for thousands of steps before suddenly realizing z predicts the exact target.
Key Novelty
Entropic Stabilization of Marginal Plateaus
  • Constructs a minimal 'wind tunnel' task separating marginal learning (easy, ambiguous) from conditional learning (hard, exact) to isolate the transition mechanism
  • Demonstrates that gradient noise acts as an entropic force that stabilizes the suboptimal marginal solution, actively opposing the model's escape to the conditional solution
  • Identifies a 'Type 2' directional asymmetry: the model represents the conditional logic internally (via a routing head) long before the loss metric reflects it
Evaluation Highlights
  • Plateau duration scales with dataset size D (power law ~D^1.19), not ambiguity K, showing the bottleneck is optimization work rather than task complexity
  • Increasing learning rate from 3e-4 to 2e-3 delays the transition by 3.6x (token-normalized), confirming noise stabilizes the plateau
  • Internal 'selector-routing' heads form and lead the loss transition by ~50% of the waiting time, showing hidden progress before the sudden snap
Breakthrough Assessment
8/10
Provides a clean, controlled experimental setup that isolates a fundamental learning dynamic. The evidence for entropic stabilization (noise delaying learning) is counter-intuitive and theoretically significant.
×