Advancing LLM Reasoning Generalists with Preference Trees

📝 Paper Summary

Complex Reasoning (Math, Coding) Preference Learning (RLHF/DPO/KTO)

Eurus improves LLM reasoning by training on a new tree-structured dataset (UltraInteract) containing multi-turn interaction trails and identifying that KTO/NCA outperform DPO for reasoning alignment.

Core Problem

Open-source LLMs significantly lag behind proprietary models (like GPT-4) in complex reasoning because existing alignment data lacks diversity in planning/interaction and standard preference learning methods often fail on reasoning tasks.

Why it matters:

Complex reasoning requires sophisticated planning and error correction, which simple instruction-response pairs cannot capture
Standard preference learning algorithms (like DPO) developed for general chat can degrade performance in strict reasoning domains
High-quality, large-scale open resources for reasoning alignment are scarce compared to general chat data

Concrete Example: When solving a difficult LeetCode problem, a standard SFT model might generate a plausible but buggy solution and stop. It lacks the training data to simulate the process of running code, observing an error, and correcting it—a trajectory explicitly captured in this paper's UltraInteract dataset.

Key Novelty

UltraInteract Preference Trees & Reasoning-Aware Reward Modeling

Constructs a dataset (UltraInteract) where each instruction is the root of a 'preference tree' containing branching reasoning chains, multi-turn interactions with a code interpreter, and paired correct/incorrect nodes at every step.
Discovers that Direct Preference Optimization (DPO) actively harms reasoning performance due to reward collapse, whereas KTO and NCA succeed.
Proposes a new reward modeling objective that explicitly pushes the absolute rewards of correct reasoning paths higher, rather than just optimizing the margin between correct and incorrect.

Evaluation Highlights

Eurus-70B achieves 33.3% pass@1 on LeetCode (Hard), outperforming the best open-source baselines by over 13.3%.
Eurus-70B attains 32.6% pass@1 on TheoremQA, matching GPT-3.5 Turbo's performance on this university-level STEM benchmark.
Eurus-RM-7B (Reward Model) achieves higher correlation with human experts than GPT-4 on the AutoJ benchmark.

Breakthrough Assessment

8/10

Significant contribution in alignment data (UltraInteract) and a critical finding regarding DPO's failure in reasoning. The resulting 70B model sets a new state-of-the-art for open-source reasoning.

⚙️ Technical Details

Problem Definition

Setting: Complex reasoning task alignment via Supervised Fine-Tuning (SFT) and Preference Learning

Inputs: Reasoning instructions (Math, Code, Logic)

Outputs: Multi-step reasoning trajectories, potentially involving code execution and self-correction

Pipeline Flow

Input Instruction
Eurus LLM (Reasoning & Planning)
Environment (Python Interpreter) [Optional/Task-dependent]
Output Generation

System Modules

Eurus LLM

Generates reasoning steps, code, or text responses

Model or implementation: Based on Mistral-7B or CodeLlama-70B

Environment

Executes code generated by the model (used during data creation and MINT evaluation)

Model or implementation: Python Code Interpreter

Novel Architectural Elements

Preference Tree Data Structure: Organized training data as branching trees of (instruction -> action -> observation) to capture multi-turn corrections and diverse planning strategies.

Modeling

Base Model: Mistral-7B (for Eurus-7B) and CodeLlama-70B (for Eurus-70B)

Training Method: SFT followed by Preference Learning (KTO/NCA)

Objective Functions:

Purpose: Reward Modeling - Standard preference optimization.

Formally: L_BT = -log(sigmoid(r(x,y_w) - r(x,y_l)))
Purpose: Reward Modeling - Explicitly increase absolute reward for chosen actions and decrease for rejected ones (to prevent reward collapse).

Formally: L_DR = -log(sigmoid(r(x,y_w))) - log(sigmoid(-r(x,y_l)))

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (Full FT)

Training Data:

UltraInteract: 86K instructions, 220K action pairs.
SFT Mixture: UltraInteract (correct leaf nodes), UltraChat, ShareGPT, OpenOrca.
Preference Mixture: UltraInteract (trajectories), UltraFeedback.

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
epochs: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-Coder: Eurus utilizes multi-turn preference trees with environment feedback rather than just code pre-training/SFT.
vs. WizardMath: Eurus targets general reasoning (Math + Code + Logic) and uses preference learning (KTO/NCA) which WizardMath lacks.
vs. Mistral-7B-Instruct: Eurus incorporates a specialized reward objective preventing reward collapse on reasoning tasks.
+ 1 more
vs. Step-Coder [not cited in paper]: Step-Coder uses RL for code generation; Eurus focuses on static preference learning (KTO/NCA) on pre-collected trees rather than online PPO.

Limitations

DPO training failed for the 70B model due to reward collapse, limiting direct comparison of all algorithms at scale.
The dataset construction relies on proprietary models (GPT-3.5/GPT-4) for generating trajectories and critiques.
Evaluation focuses primarily on Math and Code, with less emphasis on other reasoning types like common sense or humanities.
Computation cost for creating the preference trees (multi-turn sampling with GPT-4) is high.

Reproducibility

Code: https://github.com/OpenBMB/Eurus

Model checkpoints (Eurus-7B/70B, Eurus-RM-7B) and the UltraInteract dataset are publicly available at https://github.com/OpenBMB/Eurus. Training hyperparameters (LR, batch size) are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation across single-turn and multi-turn reasoning benchmarks.

Benchmarks:

LeetCode (Code Generation (Competition Level))
TheoremQA (STEM Reasoning (University Level))
MATH (Mathematics)
HumanEval (Code Generation)
MBPP (Code Generation)
MINT (Multi-turn Interaction)

Metrics:

pass@1
Success Rate (Turn 5)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Eurus-70B achieves state-of-the-art results among open-source models on challenging reasoning benchmarks.
LeetCode	pass@1	24.0	33.3	+9.3
TheoremQA	pass@1	27.2	32.6	+5.4
MATH	pass@1	24.4	31.2	+6.8
Reward Model experiments show Eurus-RM-7B outperforming larger models and GPT-4 in specific correlations.
AutoJ	Correlation	60.4	62.4	+2.0
MT-Bench	Correlation	66.7	70.2	+3.5

Main Takeaways

Preference learning algorithms designed for chat (DPO) can actively hurt reasoning performance, potentially due to reward collapse.
KTO and NCA consistently improve reasoning capabilities over SFT, particularly in math and multi-turn tasks.
The proposed reward modeling objective (combining Bradley-Terry with a Direct Reward term) significantly improves reward model performance on hard reasoning tasks.
Training on preference trees (UltraInteract) improves multi-turn interaction capabilities, allowing models to better utilize environment feedback.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Preference Learning algorithms (DPO, KTO)
Chain-of-Thought (CoT) prompting

Key Terms

Preference Tree: A data structure where an instruction is the root, and branches represent different reasoning attempts (correct and incorrect), enabling preference learning at multiple turns.

DPO: Direct Preference Optimization—an algorithm optimizing the policy to satisfy preferences without an explicit reward model. Found here to fail for reasoning.

KTO: Kahneman-Tversky Optimization—a preference learning method that aligns models using unpaired binary feedback (good/bad) rather than paired comparisons.

NCA: Noise Contrastive Alignment—a method that aligns models by contrasting the probability of the chosen response against a noise distribution.

SFT: Supervised Fine-Tuning—training the model on high-quality demonstration data before alignment.

Bradley-Terry (BT) Model: A standard statistical model for estimating the probability that one item is preferred over another, used in training reward models.

UltraInteract: The newly curated dataset in this paper containing 220K interaction trajectories structured as preference trees.