Marginals Before Conditionals

📝 Paper Summary

Training Dynamics Mechanistic Interpretability Phase Transitions in Learning

Neural networks learn simple marginal statistics before complex conditional relationships, getting stuck in a plateau where gradient noise stabilizes the suboptimal solution until a collective transition occurs.

Core Problem

Neural networks often exhibit delayed generalization, learning simple heuristics (marginals) long before grasping the full task structure (conditionals), but the mechanisms controlling this delay are poorly understood.

Why it matters:

Models often fail to exploit information present in their inputs (as seen in the reversal curse), getting stuck on partial solutions
Understanding these delays is crucial for explaining 'grokking' and other sudden capabilities that emerge only after long training plateaus
Current optimization theories do not fully explain why models linger on saddle points that have clear escape directions

Concrete Example: In a task where a base string B maps to K different targets A, but a selector token z resolves the ambiguity, the model first learns the ambiguous mapping B->A (high loss), ignoring z entirely. It plateaus there for thousands of steps before suddenly realizing z predicts the exact target.

Key Novelty

Entropic Stabilization of Marginal Plateaus

Constructs a minimal 'wind tunnel' task separating marginal learning (easy, ambiguous) from conditional learning (hard, exact) to isolate the transition mechanism
Demonstrates that gradient noise acts as an entropic force that stabilizes the suboptimal marginal solution, actively opposing the model's escape to the conditional solution
Identifies a 'Type 2' directional asymmetry: the model represents the conditional logic internally (via a routing head) long before the loss metric reflects it

Evaluation Highlights

Plateau duration scales with dataset size D (power law ~D^1.19), not ambiguity K, showing the bottleneck is optimization work rather than task complexity
Increasing learning rate from 3e-4 to 2e-3 delays the transition by 3.6x (token-normalized), confirming noise stabilizes the plateau
Internal 'selector-routing' heads form and lead the loss transition by ~50% of the waiting time, showing hidden progress before the sudden snap

Breakthrough Assessment

8/10

Provides a clean, controlled experimental setup that isolates a fundamental learning dynamic. The evidence for entropic stabilization (noise delaying learning) is counter-intuitive and theoretically significant.

⚙️ Technical Details

Problem Definition

Setting: Surjective map learning with K-fold ambiguity resolved by a selector token

Inputs: Sequence [BOS, B, SEP, z, SEP], where B is a base string and z is a selector index

Outputs: Target string A, where H(A|B) = log K but H(A|B,z) = 0

Pipeline Flow

Input Construction (B + z)
Transformer Processing
Output Prediction (A)

System Modules

Input Embedding

Embeds the base string B and selector z into vector space

Model or implementation: Standard Embedding

Transformer Block

Processes context to route information from z to predict A

Model or implementation: 4-layer Transformer (d=128, 4 heads)

Prediction Head

Predicts the target string A autoregressively

Model or implementation: Linear Unembedding

Novel Architectural Elements

None (uses standard TransformerLens implementation to study dynamics)

Modeling

Base Model: 4-layer Transformer (d=128, 4 heads, d_mlp=512)

Training Method: Supervised Learning from scratch

Training Data:

Synthetic dataset D = n_b * K unique examples
Base strings B (6 chars), Targets A (4 chars), Selectors z (2 chars)

Key Hyperparameters:

learning_rate: 1e-3
batch_size: 128
optimizer: AdamW
+ 2 more
warmup_steps: 500
weight_decay: Not reported in the paper

Compute: Single GPU training (exact specs not reported, but model is small ~600K params)

Comparison to Prior Work

vs. Power et al.: Studies transition from marginal to conditional prediction (memorization -> better memorization) rather than memorization -> generalization
vs. Papadopoulos et al.: Tracks dynamics of the asymmetry continuously rather than just final performance
vs. Nanda et al. [not cited in paper]: Focuses on entropic stabilization of the plateau rather than just the circuit formation mechanism itself

Limitations

Experiments limited to small synthetic tasks; unclear if scaling laws hold for large-scale language modeling
High seed variance in transition times complicates precise measurements of transition boundaries
Cannot fully separate the effects of noise-induced stabilization from task degradation in the label noise experiments
Focuses on a specific type of ambiguity (surjective map); may not cover all forms of conditional learning difficulty

Reproducibility

Code: https://github.com/mihirrs/synass-lens

Highly reproducible. Code, data generation scripts, and checkpoints are publicly available at https://github.com/mihirrs/synass-lens. The paper notes substantial seed variance (CV 30-70%) in waiting times, which is expected near bifurcations.

📊 Experiments & Results

Evaluation Setup

Synthetic tasks with controlled ambiguity K and dataset size D

Benchmarks:

Synthetic Surjective Map (Conditional Sequence Prediction) [New]

Metrics:

Cross-entropy loss (nats)
Waiting time (tau)
z-shuffle gap (Delta_z)
Per-group accuracy
Statistical methodology: Means and standard deviations reported for multi-seed runs; single-seed sweeps used for broad trends with validation across conditions

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments isolating the effect of dataset size D vs. ambiguity K on plateau duration.
Synthetic Surjective Map	Waiting time (tau)	1600	1600	0
Synthetic Surjective Map	Waiting time (tau)	3950	3950	0
Experiments testing the 'entropic stabilization' hypothesis by manipulating gradient noise.
Synthetic Surjective Map (K=20)	Tokens processed (tau_tok)	269000	966000	+697000
Synthetic Surjective Map (K=20)	Tokens processed (tau_tok)	409000	742000	+333000
Synthetic Surjective Map (K=20)	Lead fraction	1.0	0.48	-0.52

Experiment Figures

Loss curves showing the distinct two-stage learning process: a rapid drop to log K (marginal solution), a long plateau, and a sudden drop to 0 (conditional solution).

Impact of batch size and label noise on plateau duration (tokens processed).

Internal diagnostics: z-shuffle gap onset vs. loss drop time.

Main Takeaways

The plateau duration follows a power law relative to dataset size (tau ~ D^1.19), independent of the ambiguity factor K.
The transition from marginal to conditional is collective: all groups snap to the solution simultaneously within a narrow window (0.5 tau width).
Gradient noise acts as a stabilizer for the plateau: higher learning rates and smaller batch sizes (task-preserving noise) significantly delay the escape.
Internal circuit formation (selector routing) happens invisibly during the plateau, well before the loss metric improves.

📚 Prerequisite Knowledge

Prerequisites

Information theory (Entropy, Conditional Entropy)
Transformer architecture basics
Stochastic Gradient Descent (SGD) dynamics
Saddle points in optimization landscapes

Key Terms

marginal prediction: Predicting the target based only on the base input B, ignoring the selector z, resulting in a probability distribution over all K possible targets

conditional prediction: Predicting the specific target A using both the base B and the selector z, achieving zero entropy

entropic force: A statistical tendency for a system to remain in high-entropy states (broad basins) because there are more microstates there, effectively opposing movement to sharper, lower-entropy minima

grokking: A phenomenon where a model suddenly learns to generalize long after achieving near-zero training error (or in this case, long after plateauing)

saddle point: A point in the loss landscape with zero gradient that is a minimum in some directions but a maximum in others

directional asymmetry: The phenomenon where a model can learn a mapping X->Y easily but struggles to learn Y->X (or in this context, learns the marginal faster than the conditional)

reversal curse: A specific failure mode where LLMs trained on 'A is B' cannot answer 'Who is B?'