Tracing the Representation Geometry of Language Models from Pretraining to Post-training

📝 Paper Summary

LLM Representation Learning Mechanistic Interpretability Training Dynamics

LLM representations evolve through three distinct geometric phases during pretraining—warmup collapse, entropy-seeking expansion, and compression-seeking consolidation—which correlate with specific capabilities like memorization and generalization.

Core Problem

Standard training metrics like loss decrease monotonically and fail to explain the qualitative shifts in capabilities and internal structure that occur during LLM training.

Why it matters:

Practitioners rely on loss curves that mask internal structural changes, making it hard to diagnose when specific capabilities (like reasoning vs. memorization) emerge
Understanding how high-dimensional representations evolve is crucial for optimizing training recipes to target specific behaviors (e.g., robustness vs. diversity)
Current post-training methods (RLVR, DPO) alter model behavior, but their impact on the underlying representational geometry remains largely unmapped

Concrete Example: During the 'warmup' phase, models may exhibit 'echolalia' (repetitive outputs) despite decreasing loss, because representations have collapsed onto a few dominant directions. Later, RLVR improves math scores but reduces generative diversity, a trade-off invisible to simple loss metrics but visible in geometric compression.

Key Novelty

Spectral Geometric Phase Analysis

Identifies a consistent three-phase evolution in representation geometry using spectral metrics (RankMe, alpha-ReQ): 'warmup' (collapse), 'entropy-seeking' (expansion/memorization), and 'compression-seeking' (consolidation/generalization)
Links these geometric phases to specific downstream capabilities: expansion correlates with n-gram memorization, while compression correlates with long-range dependency learning and reasoning
Demonstrates that post-training methods drive opposing geometric shifts: SFT and DPO expand the manifold ('entropy-seeking'), while RLVR contracts it ('compression-seeking')

Architecture

Conceptual diagram of the three geometric phases during pretraining: Warmup (Collapse), Entropy-Seeking (Expansion), and Compression-Seeking (Consolidation).

Evaluation Highlights

Discovered a universal 3-phase geometric evolution across OLMo (1B-7B) and Pythia (160M-12B) model families during pretraining
SFT on Anthropic-HH caused a monotonic increase in effective rank (RankMe), correlating with a drop in win-rate from 14% to 9% against Alpaca Farm reference due to overfitting
RLVR training improved pass@16 accuracy but degraded pass@256 accuracy compared to the base model, directly tracking with a geometric contraction (compression-seeking)

Breakthrough Assessment

8/10

Provides a fundamental, mechanistic characterization of LLM learning dynamics that goes beyond loss curves. The identification of distinct geometric phases linking to specific capabilities is a significant theoretical advance.

⚙️ Technical Details

Problem Definition

Setting: Analysis of the covariance matrix of last-token representations in autoregressive language models

Inputs: Input sequence of discrete tokens s = (t_1, ..., t_N)

Outputs: Spectral metrics (Effective Rank, Eigenspectrum decay) derived from the feature covariance matrix

Pipeline Flow

Input Processing: Tokenize text sequence
Model Forward Pass: Compute activations through LLM layers
Feature Extraction: Extract last-token representation y_N
Spectral Analysis: Compute Covariance Matrix → Eigendecomposition → Calculate RankMe and alpha-ReQ

System Modules

Feature Extractor (Analysis Pipeline)

Extract high-dimensional vector representations of the last token from the final layer (or intermediate layers)

Model or implementation: Target LLM (e.g., OLMo, Pythia)

Spectral Analyzer (Analysis Pipeline)

Compute geometric metrics from feature covariance

Model or implementation: Analytical formulas (RankMe, alpha-ReQ)

Novel Architectural Elements

Application of spectral analysis (RankMe, alpha-ReQ) specifically to trace the *temporal evolution* of LLM geometry across distinct training stages (Pretraining, SFT, DPO, RLVR)

Modeling

Base Model: OLMo-2 (1B, 7B), Pythia (160M-12B), Tülu-3.1 (Llama-3.1-8B base)

Training Method: Analysis of existing checkpoints from standard training runs (SFT, DPO, RLVR)

Objective Functions:

Purpose: SFT standard objective.

Formally: Minimize negative log-likelihood of target responses
Purpose: DPO preference alignment.

Formally: Minimize L_DPO = -E[log sigma(beta * log(pi_theta/pi_ref) for preferred - dispreferred)]
Purpose: RLVR reward maximization.

Formally: Maximize J(theta) = E[sum gamma^t R_t] where R_t is a verifiable reward

Adaptation: Full fine-tuning (implied by context of standard recipes like Tülu)

Trainable Parameters: All parameters (standard for the analyzed suites)

Training Data:

Pretraining: FineWeb (for analysis), Pile (Pythia)
SFT/DPO/RLVR: Tülu-3.1 recipe datasets, Anthropic-HH, Alpaca Farm

Key Hyperparameters:

pass_at_k_N: 512
pass_at_k_k: 256
rlvr_steps: 2400 (observed in Figure 6C)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Loss curves: Identifies non-monotonic phases (collapse, expansion, compression) invisible to loss metrics
vs. Downstream eval: Disentangles memorization (entropy-seeking) from generalization (compression-seeking) before task performance saturates

Limitations

Analysis relies on publicly available checkpoints, limiting granularity to the saving frequency of those models
Causal links between geometry and performance are correlational or derived from toy models, not proven causally in large runs
Focuses primarily on last-token representations; while consistent across layers, other mechanisms might exist elsewhere
Toy model analysis assumes linear models and specific data conditions (skewed frequencies, bottleneck) to replicate phases

Reproducibility

Analyzes publicly available model checkpoints (OLMo, Pythia, Tülu-3.1). Code availability is not explicitly provided in the text. The methodology relies on standard spectral analysis metrics.

📊 Experiments & Results

Evaluation Setup

Spectral analysis of model checkpoints on validation data sequences

Benchmarks:

FineWeb (Pretraining data text sequences)
TriviaQA (Factual knowledge / Memorization probe)
Anthropic-HH (Preference dataset for SFT/DPO analysis)
Alpaca Farm (OOD Instruction following / Win-rate evaluation)
AMC-23 (Math problems for RLVR pass@k evaluation)

Metrics:

RankMe (Effective Rank)
alpha-ReQ (Eigenspectrum decay rate)
Distributional Memorization (Spearman correlation with infinity-gram)
pass@k
Win-rate (AlpacaEval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Post-training experiments reveal divergent geometric trends: SFT/DPO expand the manifold (entropy-seeking) while RLVR contracts it (compression-seeking).
Anthropic-HH (SFT)	Win-rate vs Alpaca Farm ref	0.14	0.09	-0.05
AMC-23 (RLVR)	pass@256	0.42	0.38	-0.04

Experiment Figures

Evolution of RankMe and alpha-ReQ metrics over pretraining tokens for OLMo-2 and Pythia models.

Main Takeaways

Pretraining follows three geometric phases: (1) Warmup (collapse), (2) Entropy-seeking (expansion, peak memorization), (3) Compression-seeking (consolidation, generalization).
Geometric phases are consistent across model scales (160M to 12B) and families (OLMo, Pythia).
The 'entropy-seeking' phase correlates strongly with n-gram memorization, while 'compression-seeking' aligns with reasoning improvements.
RLVR acts as a strong compressor, improving reward-specific metrics (pass@1) at the cost of diversity/exploration (pass@256), unlike the expansive nature of SFT and DPO.

📚 Prerequisite Knowledge

Prerequisites

Linear Algebra (Eigendecomposition, Covariance Matrices)
Language Model Training (Pretraining, SFT, DPO, RLVR)
Information Theory (Entropy)

Key Terms

RankMe: A metric measuring the effective rank of a matrix based on the Von Neumann entropy of its singular values; higher values indicate higher-dimensional, more isotropic representations

alpha-ReQ: The power-law decay rate of the eigenvalues of the representation covariance matrix; higher alpha indicates faster decay and more compressed, anisotropic representations

RLVR: Reinforcement Learning from Verifiable Rewards—optimizing a policy to maximize rewards based on objective verification (e.g., math correctness) rather than a learned reward model

DPO: Direct Preference Optimization—a method to align language models to preferences by optimizing the relative log-probability of preferred vs. dispreferred responses

SFT: Supervised Fine-Tuning—adapting a pretrained model using labeled instruction-response pairs via maximum likelihood estimation

pass@k: An evaluation metric that estimates the probability of generating at least one correct solution given k independent samples

distributional memorization: The correlation between an LLM's output probabilities and the n-gram frequencies in its pretraining corpus, measured using an infinite-gram model

echolalia: A failure mode where the model repeats inputs or generates repetitive, non-contextual text, observed during the initial representational collapse

isotropic: Uniformity in all directions; in this context, a representation space where variance is spread out across many dimensions (high RankMe)

anisotropic: Directionally dependent; in this context, a representation space where information is compressed along specific principal axes (low RankMe)