← Back to Paper List

Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

ByteDance Seed Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqiang Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen, Riwei Chen, Liangqiang Chen, Zixin Chen, Jinsong Chen, Siyan Chen, Kaiyuan Chen, Zhi Chen, et al.
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

Reasoning Models Reinforcement Learning (RL) for LLMs
Seed1.5-Thinking is a Mixture-of-Experts reasoning model optimized via large-scale reinforcement learning with novel process verifiers, achieving state-of-the-art performance on math and coding benchmarks.
Core Problem
Training high-quality reasoning models is difficult due to the scarcity of high-quality Chain-of-Thought (CoT) data and the extreme instability of large-scale Reinforcement Learning (RL) training.
Why it matters:
  • Current reasoning models often rely on unstable RL training that crashes frequently, with score differences up to 10 points between runs
  • Standard rule-based verifiers for math problems struggle with format variations (e.g., 2^19 vs 524288) and corner cases, leading to inaccurate reward signals
  • Existing benchmarks like AIME 2024 are becoming saturated and lack sufficient discrimination for top-tier models
Concrete Example: A standard verifier might reject a correct answer formatted as '2^{19}' if the reference is '524288', causing the model to learn incorrect behaviors. Seed1.5-Thinking uses a 'Thinking-Verifier' that reasons through the equivalence of these answers before judging.
Key Novelty
Seed1.5-Thinking with VAPO/DAPO RL and Seed-Thinking-Verifier
  • Integrates two novel RL frameworks (VAPO for actor-critic, DAPO for policy-gradient) to stabilize the notoriously unstable training of reasoning models
  • employs a 'Seed-Thinking-Verifier' that generates its own reasoning path to judge student answers, reducing reward hacking and handling complex format variations better than rule-based checkers
  • Decouples the RL infrastructure into an asynchronous streaming rollout architecture with prioritized sample pools to improve iteration speed by 3x
Evaluation Highlights
  • Achieves 86.7% on AIME 2024, matching o3-mini-high and significantly outperforming DeepSeek R1 and o1
  • Surpasses DeepSeek R1 by 8.0% in user positive feedback on non-reasoning tasks, indicating strong generalization beyond just math/code
  • Attains 55.0% pass@1 on Codeforces (based on recent 12 contests), outperforming DeepSeek R1
Breakthrough Assessment
8/10
Strong performance matching or beating current SOTA (DeepSeek R1, o1) on key benchmarks with a smaller model (20B active params). Introduces significant infrastructure and verifier improvements.
×