← Back to Paper List

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, Tiantian Fan, Zhengyin Du, Xiang Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yong-Xu Wu, Lin Yan
ByteDance Seed
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning for Reasoning Chain-of-Thought (CoT) Optimization
VAPO stabilizes value-based reinforcement learning for long reasoning tasks by using length-adaptive advantage estimation to balance bias and variance across heterogeneous sequence lengths.
Core Problem
Training value models for long Chain-of-Thought tasks is unstable due to initialization bias, the difficulty of handling widely varying response lengths with fixed parameters, and sparse reward signals.
Why it matters:
  • Value-model-free methods (like GRPO/DAPO) are stable but lack precise credit assignment, limiting the optimization ceiling for complex reasoning
  • Standard advantage estimation (GAE) with fixed decay parameters fails when sequence lengths vary drastically, causing either high variance (short responses) or high bias (long responses)
  • Reasoning tasks require traversing long decision paths where a single error causes failure, necessitating finer-grained optimization than trajectory-level rewards can provide
Concrete Example: In a long mathematical proof, a standard value model using fixed GAE (lambda=0.95) discounts the final reward so heavily over a long sequence that early tokens receive near-zero signal, relying entirely on biased bootstrap estimates. Conversely, for very short responses, the same parameter yields high-variance estimates.
Key Novelty
Length-Adaptive Generalized Advantage Estimation (GAE)
  • Dynamically adjusts the GAE discount parameter (lambda) based on the length of the generated response rather than using a fixed static value
  • Balances the bias-variance trade-off: reduces variance for short responses and mitigates accumulated bootstrapping bias for long responses
  • Integrates specific regularization techniques (Clip-Higher, Token-level Loss) into a unified value-based framework to stabilize training
Evaluation Highlights
  • Achieves score of 60.4 on AIME 2024 benchmark using Qwen 32B model, setting a new state-of-the-art for this size class
  • Outperforms value-model-free baselines (DAPO and DeepSeek-R1-Zero-Qwen-32B) by over 10 points under identical settings
  • Improves performance from 5 (Vanilla PPO) to 60 (VAPO) on AIME 2024 while maintaining stability with zero training crashes
Breakthrough Assessment
8/10
Successfully rehabilitates value-model-based RL for reasoning tasks—a domain recently dominated by value-free methods like GRPO—showing that value models can yield higher ceilings if stability issues are solved.
×