Evaluation Setup
Pretraining Language Models and Vision Encoders from scratch
Benchmarks:
- C4 (Colossal Clean Crawled Corpus) (Language Modeling (LLaMA))
- OpenWebText (Language Modeling (GPT-2))
- ImageNet-1K (Image Classification (ViT))
- CIFAR-10/100 (Image Classification (ResNet))
Metrics:
- Perplexity (PPL)
- Top-1 Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| C4 (LLaMA-60M) |
Perplexity |
28.80 |
27.88 |
-0.92
|
| C4 (LLaMA-135M) |
Perplexity |
22.23 |
21.25 |
-0.98
|
| C4 (LLaMA-350M) |
Perplexity |
16.81 |
16.79 |
-0.02
|
| OpenWebText (GPT-2 Small) |
Perplexity |
22.46 |
22.20 |
-0.26
|
| CIFAR-100 (ResNet50) |
Accuracy |
79.85 |
80.16 |
+0.31
|
| ImageNet-1K (ViT-Tiny) |
Accuracy |
71.02 |
71.16 |
+0.14
|
| C4 (LLaMA-60M) |
Perplexity |
28.17 |
27.88 |
-0.29
|
Main Takeaways
- HTMuon consistently outperforms Muon and AdamW across language and vision tasks, with gains being particularly notable in smaller LLaMA models (60M/135M).
- Analysis confirms HTMuon produces weight matrices with lower Power Law (PL) exponents (more heavy-tailed) than Muon, validating the HT-SR motivation.
- Approximations like HTMuon_NS (Newton-Schulz) and interval-based updates successfully reduce computational overhead while maintaining performance gains over Muon.