Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning

📝 Paper Summary

Spectral Optimization Second-Order Preconditioning Large Language Model Training

Mousse improves the Muon optimizer by applying spectral updates within a whitened coordinate system derived from Shampoo's curvature statistics, aligning the update geometry with the neural network's anisotropic landscape.

Core Problem

The Muon optimizer assumes an isotropic (uniform) landscape by enforcing a fixed spectral norm across all eigen-directions, which ignores the highly ill-conditioned, heavy-tailed curvature of deep neural networks.

Why it matters:

Deep neural networks have vastly different curvature scales across dimensions; treating them equally risks instability in sharp directions and slow progress in flat directions
Standard spectral methods like Muon waste sample efficiency by failing to adapt to local geometry, requiring more training steps to reach convergence
Existing solutions either lack spectral constraints (Shampoo) or introduce high memory overhead (SOAP), leaving a gap for efficient, curvature-aware spectral optimization

Concrete Example: In a loss landscape where one direction is very sharp (high curvature) and another is very flat, standard Muon updates both with the same spectral magnitude. This causes oscillations in the sharp direction while moving too slowly in the flat direction. Mousse 'spheres' this landscape first, allowing the spectral update to move appropriately in both.

Key Novelty

Muon Optimization Utilizing Shampoo’s Structural Estimation (Mousse)

Combines spectral optimization (Muon) with second-order preconditioning (Shampoo) by performing the Newton-Schulz spectral update inside a coordinate system whitened by Kronecker-factored curvature statistics
Eliminates the need for Adam-style second moment states found in SOAP by relying on the spectral constraint for step size regulation, reducing memory overhead while retaining geometric adaptivity

Architecture

Schematic comparison of update mechanisms for Muon, SOAP, and Mousse

Evaluation Highlights

Reduces training steps by ~12% to reach the same target loss compared to standard Muon on 800M parameter models
Lowers final validation loss by 0.012 on an 800M parameter model compared to the best Muon baseline
Incurs only ~3% wall-clock time overhead compared to Muon, significantly outperforming the throughput of SOAP

Breakthrough Assessment

8/10

Offers a theoretically grounded unification of spectral and second-order methods that yields strict Pareto improvements in efficiency (lower loss, faster convergence) with negligible computational cost.

⚙️ Technical Details

Problem Definition

Setting: Optimization of Large Language Model parameters in a high-dimensional, ill-conditioned loss landscape

Inputs: Gradients G = ∇L

Outputs: Parameter update step ΔW

Pipeline Flow

Gradient Computation
Update Curvature Statistics (L, R via Shampoo)
Compute Whitening Factors (P, Q)
Whiten Gradient (Preconditioning)
Newton-Schulz Iteration (Spectral Constraint)
Un-whiten Update (Re-projection)
Parameter Update

System Modules

Curvature Accumulator (Preconditioning)

Accumulate gradients into Left (L) and Right (R) Kronecker-factored statistics via exponential moving average

Model or implementation: Shampoo-style Statistics

Whitener (Preconditioning)

Compute whitening factors P=L^(1/4) and Q=R^(1/4) and transform the gradient G to G_tilde

Model or implementation: Matrix Power / Decomposition

Spectral Projector (Optimization)

Apply Newton-Schulz iteration to the preconditioned gradient to enforce spectral constraints

Model or implementation: Newton-Schulz Iteration

Re-projector (Optimization)

Transform the orthogonalized update back to the original parameter space using inverse whitening factors

Model or implementation: Matrix Multiplication

Novel Architectural Elements

Integration of Shampoo-style whitening directly into the Newton-Schulz iteration loop
Removal of Adam-style second-momentum buffer by relying on spectral normalization for magnitude control

Modeling

Base Model: GPT-2 (Decoder-only transformer)

Training Method: Pre-training from scratch

Training Data:

FineWeb dataset
20 billion tokens total
Global batch size 2M tokens

Key Hyperparameters:

learning_rate_schedule: Cosine decay with 10% linear warmup
training_steps: 10,000
initialization: Spectral scaling condition
+ 1 more
optimizer_specifics: Newton-Schulz iterations for spectral constraint

Compute: 3% wall-clock time overhead compared to standard Muon; significantly faster than SOAP

Comparison to Prior Work

vs. Muon: Mousse applies spectral constraints in a whitened basis (anisotropic) rather than the raw parameter space (isotropic)
vs. SOAP: Mousse uses Newton-Schulz for update scaling instead of Adam-style variance normalization, reducing memory overhead
vs. Shampoo: Mousse imposes a strict spectral constraint (Stiefel manifold) on the update, whereas Shampoo scales gradients
+ 2 more
vs. PolarGrad [not cited in paper]: PolarGrad scales spectral updates by nuclear norm to retain magnitude, whereas Mousse whitens the geometry before the spectral update
vs. K-FAC [not cited in paper]: Mousse uses Shampoo's accumulated statistics rather than instantaneous block-diagonal Hessian approximations

Limitations

No specific details on scaling to extremely large models (>800M) beyond the tested range
Relies on the stability of Shampoo statistics; if curvature estimation is unstable, updates may degrade
Incurs a small (3%) computational overhead compared to pure Muon
Implementation complexity is higher than standard AdamW or Muon due to matrix roots/inverses

Reproducibility

Implemented in a modified version of the 'Dion' distributed Muon training library. Code URL not provided in paper text. Uses public FineWeb dataset and standard GPT-2 architectures.

📊 Experiments & Results

Evaluation Setup

Language Modeling (Pre-training)

Benchmarks:

FineWeb (Next Token Prediction)

Metrics:

Validation Loss
Training Steps to Convergence
Wall-clock Training Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Mousse consistently achieves lower validation loss compared to baselines across all model sizes tested.
FineWeb	Validation Loss (800M Model)	2.440	2.428	-0.012
FineWeb	Training Steps to Target Loss	10000	8800	-1200
FineWeb	Memory Usage (vs SOAP)	100	88	-12
FineWeb	Memory Usage (vs Muon)	1.00	1.05	+0.05

Experiment Figures

Validation loss curves for 160M, 240M, 480M, and 800M models comparing AdamW, Muon, SOAP, and Mousse

Scatter plot of Validation Loss vs. Total Training Time for different optimizers

Main Takeaways

Mousse strictly dominates Muon in performance, shifting the Pareto frontier of loss vs. compute
The method is robust to learning rate choices, exhibiting a similar sensitivity profile to Muon but with consistently better loss
Computational overhead is negligible (3%) compared to the significant throughput degradation seen in SOAP
Sample efficiency is improved by ~12%, meaning models can be trained faster or to better quality for the same compute budget

📚 Prerequisite Knowledge

Prerequisites

Understanding of gradient descent and adaptive optimizers (AdamW)
Familiarity with spectral optimization (Muon, Newton-Schulz iteration)
Knowledge of second-order preconditioning (K-FAC, Shampoo)

Key Terms

Spectral Optimization: Optimization methods that constrain update steps based on the spectral norm (largest singular value) of the update matrix, rather than element-wise norms

Newton-Schulz iteration: A matrix iteration method used to approximate the polar decomposition (or matrix sign function) of a matrix, projecting it onto the Stiefel manifold

Shampoo: A second-order optimizer that approximates the Hessian using Kronecker-factored statistics (tensor products of smaller matrices) to capture parameter correlations

Whitening: A linear transformation that decorrelates data and normalizes its variance; here, transforming the gradient space so the local curvature becomes spherical

Isotropic vs. Anisotropic: Isotropic means properties are uniform in all directions; anisotropic means they vary by direction (e.g., curvature in neural nets is highly anisotropic)

Stiefel Manifold: The set of matrices with orthonormal columns; constraining updates here ensures directional stability

Kronecker-factored statistics: Approximating a large matrix (like the Hessian) as the Kronecker product of two smaller matrices to save memory and compute

Muon: A momentum-orthogonalized optimizer that updates parameters using Newton-Schulz iterations to enforce spectral constraints

SOAP: An optimizer combining Shampoo preconditioning with Adam-style momentum and adaptive step sizes

Pareto frontier: The set of optimal trade-offs where no metric can be improved without degrading another (e.g., training speed vs. final loss)