HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

📝 Paper Summary

LLM Pretraining Optimization Matrix-based Optimizers Heavy-Tailed Self-Regularization

HTMuon improves the Muon optimizer by raising singular values of the momentum matrix to a power p < 1, inducing heavy-tailed weight spectra associated with better generalization.

Core Problem

The Muon optimizer's orthogonalization step sets all singular values to one, which suppresses heavy-tailed weight spectra (correlated with better generalization) and over-emphasizes noise-dominated directions.

Why it matters:

Muon's performance often degrades with model scale and training duration compared to vector-based optimizers like Adam
HT-SR (Heavy-Tailed Self-Regularization) theory suggests well-trained networks naturally exhibit heavy-tailed weight spectra; suppressing this limits model quality
Uniformly weighting all singular vector directions includes noise-dominated small singular values, potentially hurting convergence stability

Concrete Example: In Muon, if a singular value in the momentum matrix is very small (representing noise), the orthogonalization step boosts it to 1.0, amplifying noise. HTMuon keeps it smaller (by raising to power p), effectively filtering noise while maintaining heavy tails.

Key Novelty

Heavy-Tailed Spectral Correction for Matrix Optimizers

Modifies Muon's update rule by raising momentum singular values to a power p (0 < p < 1) instead of setting them all to 1 (orthogonalization)
Acts as a bridge between Muon (p=0) and SGDM (p=1), preserving matrix-based parameter coupling while allowing heavy-tailed updates
Theoretically corresponds to steepest descent under a Schatten-q norm constraint rather than Muon's Schatten-infinity norm

Architecture

The core logic of the HTMuon optimizer.

Evaluation Highlights

Reduces perplexity by 0.98 compared to Muon when pretraining LLaMA-135M on C4 dataset
Outperforms state-of-the-art optimizers (AdamW, SOAP, MARS, COSMOS) across LLaMA sizes (60M, 135M, 350M, 1B)
Improves accuracy on ImageNet-1K with ViT-Tiny (+0.14% vs Muon) and CIFAR-100 with ResNet50 (+0.31% vs Muon)

Breakthrough Assessment

8/10

Strong empirical results consistently outperforming Muon and AdamW across scales. Theoretical grounding in HT-SR and Schatten-norms provides a solid justification for the modification.

⚙️ Technical Details

Problem Definition

Setting: Non-convex stochastic optimization for neural network training

Inputs: Network weights W, Loss function L, Gradients G

Outputs: Updated weights W_{t+1}

Pipeline Flow

Gradient Computation
Momentum Update
Spectral Correction (SVD + Power Transform)
Weight Update

System Modules

Momentum Tracker

Accumulates gradients with exponential moving average

Model or implementation: Standard momentum: M_t = beta * M_{t-1} + (1-beta) * G_t

Spectral Corrector

Modifies the spectrum of the momentum matrix to be heavy-tailed

Model or implementation: SVD-based: U * Sigma^p * V^T

Weight Updater

Applies the update to weights with scaling

Model or implementation: W_{t+1} = W_t - eta * (lambda * W_t + s * O_t)

Novel Architectural Elements

Insertion of a spectral power transform (Sigma^p) within the optimizer update step, replacing the strict orthogonalization (Sigma^0) of Muon

Modeling

Base Model: LLaMA (60M to 1B parameters), GPT-2 Small, ResNet, ViT

Training Method: Pretraining from scratch

Objective Functions:

Purpose: Minimize standard cross-entropy loss.

Formally: min_W E[L(W; xi)]

Training Data:

C4 dataset (LLaMA pretraining)
OpenWebText (GPT-2 pretraining)
CIFAR-10/100, ImageNet-1K (Vision tasks)

Key Hyperparameters:

p: 0.125 (default)
learning_rate: Typically 0.02 - 0.05 (task dependent)
momentum_beta: 0.95
+ 2 more
weight_decay: 0.1 (LLaMA tasks)
batch_size: 256 (LLaMA tasks)

Compute: Slightly higher per-step cost than Muon due to SVD, but mitigatable via interval updates or Newton-Schulz approximation (HTMuon_NS)

Comparison to Prior Work

vs. Muon: Applies fractional power p to singular values instead of setting to 1; induces heavy tails.
vs. AdamW: Matrix-based (captures correlations) vs Vector-based (element-wise).
vs. SGDM: p < 1 maintains some preconditioning benefits vs p=1 (pure SGDM).
+ 1 more
vs. Shampoo: Specific focus on heavy-tailed spectral shape rather than inverse-covariance preconditioning.

Limitations

SVD computation adds runtime overhead compared to pure Newton-Schulz Muon (mitigated by interval updates)
Introduces a new hyperparameter p (though p=0.125 is robust)
Not evaluated on models larger than 1B parameters or massive scale datasets

Reproducibility

Code: https://github.com/TDCSZ327/HTmuon

Code is publicly available. Hyperparameters for all experiments are detailed in Appendix C. Pseudo-code for exact (SVD) and approximate (NS) versions is provided.

📊 Experiments & Results

Evaluation Setup

Pretraining Language Models and Vision Encoders from scratch

Benchmarks:

C4 (Colossal Clean Crawled Corpus) (Language Modeling (LLaMA))
OpenWebText (Language Modeling (GPT-2))
ImageNet-1K (Image Classification (ViT))
CIFAR-10/100 (Image Classification (ResNet))

Metrics:

Perplexity (PPL)
Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
C4 (LLaMA-60M)	Perplexity	28.80	27.88	-0.92
C4 (LLaMA-135M)	Perplexity	22.23	21.25	-0.98
C4 (LLaMA-350M)	Perplexity	16.81	16.79	-0.02
OpenWebText (GPT-2 Small)	Perplexity	22.46	22.20	-0.26
CIFAR-100 (ResNet50)	Accuracy	79.85	80.16	+0.31
ImageNet-1K (ViT-Tiny)	Accuracy	71.02	71.16	+0.14
C4 (LLaMA-60M)	Perplexity	28.17	27.88	-0.29

Experiment Figures

Comparison of Muon_NS (approximate) vs Muon_SVD (exact) on LLaMA training and their update spectra.

Main Takeaways

HTMuon consistently outperforms Muon and AdamW across language and vision tasks, with gains being particularly notable in smaller LLaMA models (60M/135M).
Analysis confirms HTMuon produces weight matrices with lower Power Law (PL) exponents (more heavy-tailed) than Muon, validating the HT-SR motivation.
Approximations like HTMuon_NS (Newton-Schulz) and interval-based updates successfully reduce computational overhead while maintaining performance gains over Muon.

📚 Prerequisite Knowledge

Prerequisites

Matrix-based optimization (e.g., Shampoo, Muon)
Singular Value Decomposition (SVD)
Heavy-Tailed Self-Regularization (HT-SR) theory
Schatten norms

Key Terms

HT-SR: Heavy-Tailed Self-Regularization—a theory stating that well-trained neural networks exhibit heavy-tailed empirical spectral densities (ESD) in their weight matrices

ESD: Empirical Spectral Density—the distribution of eigenvalues of the correlation matrix of weights

PL exponent alpha: Power Law exponent—a metric where smaller values indicate more heavy-tailed distributions (better generalization according to HT-SR)

Muon: A matrix-based optimizer that orthogonalizes the momentum matrix (sets singular values to 1) to capture parameter interdependencies

Schatten-p norm: A matrix norm defined as the p-th root of the sum of the p-th powers of singular values

Newton-Schulz iteration: A numerical method to approximate matrix inverses or roots without explicit eigendecomposition

PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower is better