← Back to Paper List

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi
Allen Institute for AI, University of Washington
arXiv (2024)
RL Reasoning Benchmark

📝 Paper Summary

Post-training recipes Reinforcement Learning from Human Feedback (RLHF) Open-weight language models
Tülu 3 provides a fully open, state-of-the-art post-training recipe (SFT, DPO, and RLVR) that allows open models to outperform their base instruct counterparts and rival closed models like GPT-4o-mini.
Core Problem
While post-training is essential for frontier models, open recipes lag behind proprietary ones, and successful models rarely release their full data mixtures, code, or training details.
Why it matters:
  • Lack of transparency prevents the community from reproducing or understanding the 'secret sauce' behind state-of-the-art model performance
  • Open-source counterparts often rely on outdated or simplified pipelines that fail to match the performance of closed models on core skills like math and coding
  • Discrepancies in data curation and contamination make it difficult to rigorously evaluate progress in post-training techniques
Concrete Example: When training open models, standard recipes often fail to improve specific skills like math or precise constraint following without degrading general chat. For instance, Llama 3.1 8B Instruct achieves only 83.4% on GSM8K, whereas the proposed recipe pushes this to 87.6% by integrating verifiable rewards.
Key Novelty
Reinforcement Learning with Verifiable Rewards (RLVR) within a full open recipe
  • Introduces a specific post-training stage (RLVR) that uses ground-truth verification (e.g., math solutions, format constraints) as a binary reward signal instead of a learned reward model
  • Scales preference tuning using 'on-policy' data curation, where the model generates its own comparison pairs against other models to create fresh training signal
  • Implements a rigorous 'development vs. unseen' evaluation suite with aggressive n-gram decontamination to prevent overfitting to benchmarks
Evaluation Highlights
  • Tülu 3 70B outperforms GPT-4o-mini and Claude 3.5 Haiku on the Tülu 3 Eval suite average
  • Achieves +8.8 point increase on GSM8K (93.5%) compared to Llama 3.1 70B Instruct (84.7%)
  • Tülu 3 405B achieves 95.5% on GSM8K, outperforming GPT-4o (11-24 snapshot) which scores 91.7%
Breakthrough Assessment
9/10
Significantly closes the gap between open and closed post-training by releasing the entire high-performance pipeline (data, code, recipe), including the novel RLVR implementation.
×