← Back to Paper List

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
Kuaishou Technology
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Large Language Model Reasoning Reinforcement Learning from Human Feedback (RLHF)
Klear-Reasoner improves reasoning performance in 8B models by replacing standard RL clipping with Gradient-Preserving Policy Optimization (GPPO), which retains bounded gradients for high-entropy tokens to improve exploration and convergence.
Core Problem
Standard clipping in reinforcement learning (like PPO/GRPO) indiscriminately discards gradients for tokens outside the trust region, suppressing valuable exploration signals from high-entropy tokens and slowing down learning from negative samples.
Why it matters:
  • High-entropy tokens often correspond to critical 'Aha!' moments or decision branches in reasoning chains; clipping them prevents the model from reinforcing these exploratory breakthroughs
  • When suboptimal trajectories are clipped (importance ratio < 1-epsilon), the model receives no gradient signal to correct them, forcing it to repeatedly sample bad outputs before learning
  • Reproducing high-performance reasoning models (like O1 or DeepSeek-R1) remains difficult due to these training instabilities and lack of public details
Concrete Example: In a math problem, if the model attempts a novel but correct step that drastically changes the policy distribution (high importance ratio), standard PPO clips the gradient to zero, effectively ignoring this 'lightbulb moment'. Similarly, if it generates a clearly wrong step with low probability, clipping prevents the model from receiving the 'don't do that' signal.
Key Novelty
Gradient-Preserving Policy Optimization (GPPO)
  • Instead of zeroing out gradients when the ratio between new and old policies deviates too far (clipping), GPPO keeps the gradients active but bounds them within a safe range
  • For positive advantages (good actions), it allows updates even if the ratio is high, preserving exploration signals from 'surprising' good moves
  • For negative advantages (bad actions), it ensures gradients flow back even if the ratio is low, allowing the model to quickly learn 'what not to do' without waiting for repeated sampling
Evaluation Highlights
  • 90.5% accuracy on AIME 2024 benchmark with Klear-Reasoner-8B
  • 83.2% accuracy on AIME 2025 benchmark
  • 66.0% pass rate on LiveCodeBench V5 (coding tasks)
Breakthrough Assessment
8/10
Achieves state-of-the-art reasoning performance for 8B models, reportedly outperforming DeepSeek-R1-Distill-8B on difficult benchmarks like AIME via a principled improvement to the RL objective.
×