← Back to Paper List

Advancing LLM Reasoning Generalists with Preference Trees

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun
Tsinghua University, University of Illinois Urbana-Champaign, Northeastern University, Renmin University of China, Tencent
arXiv.org (2024)
Reasoning RL Agent Benchmark

📝 Paper Summary

Complex Reasoning (Math, Coding) Preference Learning (RLHF/DPO/KTO)
Eurus improves LLM reasoning by training on a new tree-structured dataset (UltraInteract) containing multi-turn interaction trails and identifying that KTO/NCA outperform DPO for reasoning alignment.
Core Problem
Open-source LLMs significantly lag behind proprietary models (like GPT-4) in complex reasoning because existing alignment data lacks diversity in planning/interaction and standard preference learning methods often fail on reasoning tasks.
Why it matters:
  • Complex reasoning requires sophisticated planning and error correction, which simple instruction-response pairs cannot capture
  • Standard preference learning algorithms (like DPO) developed for general chat can degrade performance in strict reasoning domains
  • High-quality, large-scale open resources for reasoning alignment are scarce compared to general chat data
Concrete Example: When solving a difficult LeetCode problem, a standard SFT model might generate a plausible but buggy solution and stop. It lacks the training data to simulate the process of running code, observing an error, and correcting it—a trajectory explicitly captured in this paper's UltraInteract dataset.
Key Novelty
UltraInteract Preference Trees & Reasoning-Aware Reward Modeling
  • Constructs a dataset (UltraInteract) where each instruction is the root of a 'preference tree' containing branching reasoning chains, multi-turn interactions with a code interpreter, and paired correct/incorrect nodes at every step.
  • Discovers that Direct Preference Optimization (DPO) actively harms reasoning performance due to reward collapse, whereas KTO and NCA succeed.
  • Proposes a new reward modeling objective that explicitly pushes the absolute rewards of correct reasoning paths higher, rather than just optimizing the margin between correct and incorrect.
Evaluation Highlights
  • Eurus-70B achieves 33.3% pass@1 on LeetCode (Hard), outperforming the best open-source baselines by over 13.3%.
  • Eurus-70B attains 32.6% pass@1 on TheoremQA, matching GPT-3.5 Turbo's performance on this university-level STEM benchmark.
  • Eurus-RM-7B (Reward Model) achieves higher correlation with human experts than GPT-4 on the AutoJ benchmark.
Breakthrough Assessment
8/10
Significant contribution in alignment data (UltraInteract) and a critical finding regarding DPO's failure in reasoning. The resulting 70B model sets a new state-of-the-art for open-source reasoning.
×