Evaluation Setup
Zero-shot evaluation on mathematical reasoning and code generation benchmarks
Benchmarks:
- AIME 2024 (Challenging Mathematical Reasoning)
- AIME 2025 (Challenging Mathematical Reasoning)
- LiveCodeBench V5 (Code Generation / Competition Coding)
- LiveCodeBench V6 (Code Generation / Competition Coding)
Metrics:
- Accuracy (Math)
- Pass@1 (Code)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- High-quality data is more effective than large diverse data for Long CoT SFT; unfiltered 'difficult' samples aid exploration while unfiltered 'simple' errors harm performance.
- GPPO addresses the 'delayed convergence' problem of negative samples in RL by preserving gradients for suboptimal trajectories that standard PPO would clip.
- Klear-Reasoner-8B achieves high performance on AIME (90.5% on 2024), demonstrating that 8B models can achieve strong reasoning capabilities with correct RL fine-tuning.
- Soft rewards (based on pass rate of test cases) are crucial for code RL to mitigate sparse reward issues, compared to binary rewards used in math.