← Back to Paper List

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James Validad Miranda, Jacob Daniel Morrison, Tyler C. Murray, Crystal Nam, Valentina Pyatkin, et al.
Allen Institute for AI, University of Washington, New York University
arXiv.org (2024)
Pretraining RL Benchmark Reasoning

📝 Paper Summary

Open Source Large Language Models Language Model Pretraining Curriculum Learning
OLMo 2 is a fully open language model family (7B–32B) that achieves competitive performance through improved training stability, a specialized mid-training curriculum, and a verifiable reinforcement learning post-training recipe.
Core Problem
Most 'open' models release only weights, obscuring the training data and recipes needed for scientific study, while fully open models often lag behind state-of-the-art performance due to training instabilities and suboptimal data mixing.
Why it matters:
  • Lack of transparency prevents researchers from studying critical behaviors like memorization, concept acquisition, and training dynamics
  • Training instabilities (loss spikes) at scale waste massive amounts of compute and hinder the development of larger open models
  • The gap between open-weights and fully open models limits the democratization of AI research capabilities
Concrete Example: During training, models like OLMo-0424 experienced frequent gradient norm spikes caused by repeated n-grams (e.g., 'g4ODg4OD...') in web data, leading to training divergence or requiring costly restarts, which OLMo 2 mitigates.
Key Novelty
Two-Stage Training with Stability-Focused Architecture and Mid-Training Annealing
  • Divides training into a massive pretraining stage on web data and a 'mid-training' annealing stage on high-quality STEM/math data (Dolmino Mix) to specialize capabilities
  • Implements specific architectural and data interventions (Q-K Norm, Z-Loss, n-gram filtering) to eliminate the loss spikes that plagued previous iterations
  • Uses 'Checkpoint Soups' during the mid-training phase, averaging models from multiple runs with different data orders to find better local minima
Evaluation Highlights
  • OLMo 2 7B (Base) scores 62.9% on MMLU, outperforming Llama 3.1 8B (61.8%) and Mistral 7B (58.9%)
  • OLMo 2 13B (Base) achieves 60.9% on GSM8K (math), significantly higher than Llama 2 13B (38.4%) and approaching Qwen 2.5 7B (55.8%)
  • OLMo 2 7B-Instruct achieves 56.5% average across 6 diverse instruction benchmarks, competitive with Llama 3.1 8B Instruct (59.1%)
Breakthrough Assessment
8/10
While not establishing a new SOTA for all model sizes, it closes the gap between 'fully open' (data/code included) and 'open weights' models, providing a critical artifact for the research community.
×