Tulu 3: Pushing Frontiers in Open Language Model Post-Training

📝 Paper Summary

Post-training recipes Reinforcement Learning from Human Feedback (RLHF) Open-weight language models

Tülu 3 provides a fully open, state-of-the-art post-training recipe (SFT, DPO, and RLVR) that allows open models to outperform their base instruct counterparts and rival closed models like GPT-4o-mini.

Core Problem

While post-training is essential for frontier models, open recipes lag behind proprietary ones, and successful models rarely release their full data mixtures, code, or training details.

Why it matters:

Lack of transparency prevents the community from reproducing or understanding the 'secret sauce' behind state-of-the-art model performance
Open-source counterparts often rely on outdated or simplified pipelines that fail to match the performance of closed models on core skills like math and coding
Discrepancies in data curation and contamination make it difficult to rigorously evaluate progress in post-training techniques

Concrete Example: When training open models, standard recipes often fail to improve specific skills like math or precise constraint following without degrading general chat. For instance, Llama 3.1 8B Instruct achieves only 83.4% on GSM8K, whereas the proposed recipe pushes this to 87.6% by integrating verifiable rewards.

Key Novelty

Reinforcement Learning with Verifiable Rewards (RLVR) within a full open recipe

Introduces a specific post-training stage (RLVR) that uses ground-truth verification (e.g., math solutions, format constraints) as a binary reward signal instead of a learned reward model
Scales preference tuning using 'on-policy' data curation, where the model generates its own comparison pairs against other models to create fresh training signal
Implements a rigorous 'development vs. unseen' evaluation suite with aggressive n-gram decontamination to prevent overfitting to benchmarks

Evaluation Highlights

Tülu 3 70B outperforms GPT-4o-mini and Claude 3.5 Haiku on the Tülu 3 Eval suite average
Achieves +8.8 point increase on GSM8K (93.5%) compared to Llama 3.1 70B Instruct (84.7%)
Tülu 3 405B achieves 95.5% on GSM8K, outperforming GPT-4o (11-24 snapshot) which scores 91.7%

Breakthrough Assessment

9/10

Significantly closes the gap between open and closed post-training by releasing the entire high-performance pipeline (data, code, recipe), including the novel RLVR implementation.

⚙️ Technical Details

Problem Definition

Setting: Multi-stage post-training of pre-trained Large Language Models (LLMs) to align them with user instructions and enhance core capabilities

Inputs: Pre-trained base model weights (Llama 3.1) and curated instruction/preference datasets

Outputs: Post-trained chat/instruct model (Tülu 3)

Modeling

Base Model: Llama-3.1-8B, Llama-3.1-70B, Llama-3.1-405B

Training Method: Three-stage pipeline: SFT -> DPO -> RLVR

Objective Functions:

Purpose: SFT minimizes negative log-likelihood of target tokens.

Formally: Standard Cross-Entropy Loss.
Purpose: DPO optimizes policy to favor preferred answers over rejected ones.

Formally: Length-Normalized DPO objective.
Purpose: RLVR maximizes expected return from verifiable tasks.

Formally: PPO (Proximal Policy Optimization) with sparse binary rewards (1 if correct, 0 if incorrect).

Adaptation: Full fine-tuning (all parameters)

Training Data:

SFT: 939,344 prompts from WildChat, FLAN v2, No Robots, OpenMathInstruct2, etc.
DPO: 354,192 preference pairs (mix of off-policy and on-policy generations)
RLVR: Domain-specific mix (MATH, GSM8K, IFEval) with ground truth labels

Key Hyperparameters:

sft_optimizer: AdamW
dpo_beta: Values typically 0.05 - 0.1 (specifics in Appendix/Code)
rlvr_algorithm: PPO
+ 1 more
rlvr_reward_structure: Binary (1.0 for correct verification, 0.0 otherwise)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Llama 3.1 Instruct: Tülu 3 uses RLVR to specifically target math and constraints, achieving higher scores on GSM8K/IFEval
vs. Zephyr: Tülu 3 employs a more complex multi-stage pipeline (SFT+DPO+RLVR) and on-policy data generation rather than just distilled DPO
vs. OpenMathInstruct [not cited in paper]: Tülu 3 integrates math specialization into a generalist model via mixing strategies rather than training a specialized math-only model

Limitations

RLVR is currently limited to domains with ground-truth verification (Math, Code, Constraints) and does not cover open-ended safety or chat
TruthfulQA and MMLU numbers are sometimes incompatible with standard leaderboards due to formatting/infrastructure differences
High computational cost for the full multi-stage pipeline compared to simple SFT

Reproducibility

Code: https://github.com/allenai/open-instruct

publicly available (https://github.com/allenai/open-instruct). All artifacts released: model weights (SFT, DPO, RLVR checkpoints), full training data mixtures, decontamination code, and evaluation suite. No closed-source dependencies for the final models, though some synthetic data was originally generated via GPT-4o.

📊 Experiments & Results

Evaluation Setup

Evaluated on a 'Development' suite (seen capabilities) and an 'Unseen' suite to test generalization

Benchmarks:

MMLU (General Knowledge (5-shot/0-shot CoT))
GSM8K (Math Reasoning (8-shot CoT))
IFEval (Instruction Following / Constraints)
AlpacaEval 2 (General Chat (Length-controlled winrate))
HumanEval (Python Coding (Pass@10))

Metrics:

Accuracy / Exact Match
Pass@10
Win Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
70B Model Comparisons: Tülu 3 70B outperforms the base Llama 3.1 Instruct model across multiple key benchmarks, particularly in math and instruction following.
GSM8K	EM (8-shot CoT)	84.7	93.5	+8.8
IFEval	Prompt Loose (Accuracy)	79.9	83.2	+3.3
AlpacaEval 2	LC % win	66.1	49.8	-16.3
405B Model Comparisons: Tülu 3 405B demonstrates competitiveness with closed frontier models.
GSM8K	EM (8-shot CoT)	91.7	95.5	+3.8
HumanEval	pass@10	95.9	95.9	0.0

Main Takeaways

The multi-stage recipe (SFT -> DPO -> RLVR) consistently improves performance on hard skills (Math, Coding) without catastrophic forgetting of general knowledge
On-policy preference data (used in DPO) and verifiable rewards (used in RLVR) are critical drivers of performance gains over standard SFT
Rigorous decontamination is necessary; many public datasets contain significant overlap with benchmarks, which Tülu 3 identifies and removes
While Tülu 3 excels at 'hard' skills, it sometimes trades off subjective chat win-rates (AlpacaEval) compared to models heavily optimized for that metric

📚 Prerequisite Knowledge

Prerequisites

Supervised Finetuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Proximal Policy Optimization (PPO)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—an RL training stage using binary rewards based on ground-truth correctness (e.g., correct math answer) rather than a reward model

SFT: Supervised Finetuning—training the model on prompt-completion pairs to learn instruction following

DPO: Direct Preference Optimization—a method to align models to preferences without an explicit reward model loop, using pairs of preferred/rejected responses

On-policy data: Training data generated by the current version of the model being trained, as opposed to 'off-policy' data generated by other models

Decontamination: The process of removing training examples that overlap with evaluation benchmarks to ensure fair testing

PPO: Proximal Policy Optimization—an RL algorithm used here for the RLVR stage

IFEval: Instruction Following Evaluation—a benchmark testing a model's ability to follow verifiable constraints (e.g., 'no capitalization')

GSM8K: Grade School Math 8K—a benchmark of grade-school level math word problems

MMLU: Massive Multitask Language Understanding—a general knowledge benchmark covering roughly 57 subjects