← Back to Paper List

Rethinking Expert Trajectory Utilization in LLM Post-training

Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin
Zhejiang University, Westlake University, Huawei Noah’s Ark Lab
arXiv (2025)
Reasoning RL Benchmark

📝 Paper Summary

LLM Post-training Mathematical Reasoning
The Sequential SFT-then-RL pipeline outperforms synchronized approaches by using large-scale SFT to establish a performance foundation that maximizes the subsequent plasticity available for Reinforcement Learning.
Core Problem
It is unclear how to optimally utilize expert trajectories (SFT data) to maximize reasoning performance, with conflicting evidence between Synchronized SFT-RL (mixing imitation into RL) and Sequential SFT-then-RL paradigms.
Why it matters:
  • Synchronized methods claim efficiency but rely on limited data (~46K), raising doubts about their robustness at scale
  • Practitioners rely on Sequential SFT-then-RL empirically without rigorous guidelines on the optimal timing for switching phases
  • The 'Less is More' data hypothesis suggests minimal SFT data is sufficient, but it is unknown if this limits the model's potential for subsequent RL scaling
Concrete Example: Synchronized methods like SRFT integrate imitation loss directly into the RL loop to boost efficiency. However, when scaled to large datasets (889K samples), these methods often exhibit instability or lower performance ceilings compared to simply fine-tuning on the data first and then running RL.
Key Novelty
Plasticity-Ceiling Framework
  • Decomposes final performance into two measurable components: the realized SFT Performance (foundation) and the remaining RL Plasticity (potential for further growth)
  • Demonstrates that a robust SFT phase is necessary to maximize the starting foundation, which contradicts 'Less is More' by showing that more SFT data increases the final ceiling
  • Identifies 'mild overfitting' in SFT as the optimal signal to switch to RL, ensuring the foundation is maximized without destroying plasticity
Evaluation Highlights
  • Benchmarked on 6 mathematical datasets (including GSM8K, MATH, and OlympiadBench) using Qwen2.5-7B and Llama3.2-3B
  • Constructed and evaluated on a large-scale SFT dataset of 889K distilled DeepSeek trajectories to test scaling limits
  • Refutes the 'Less is More' hypothesis for the final ceiling, showing that SFT data scale determines primary potential while difficulty acts as a multiplier
Breakthrough Assessment
8/10
Provides a theoretical framework and rigorous empirical scaling laws for the SFT-then-RL pipeline, resolving a major industry debate about post-training paradigms and data efficiency.
×