Evaluation Setup
Code generation evaluated on functional correctness via test cases.
Benchmarks:
- LiveCodeBench v5 (Code Generation (recent problems))
- HumanEval(+) (Python Code Generation)
- MBPP(+) (Python Code Generation)
- BigCodeBench (Complex Code Generation)
- LCB-RB (Reasoning Quality Evaluation) [New]
Metrics:
- Pass@1
- Reward Model Accuracy
- Statistical methodology: Chi-square test used for correlation analysis between reasoning quality and correctness (p < 0.001).
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Average (LiveCodeBench, HumanEval+, MBPP+, BigCodeBench) |
Pass@1 |
54.9 |
57.4 |
+2.5
|
| LiveCodeBench |
Pass@1 |
39.1 |
46.2 |
+7.1
|
| Average (GSM8K, MATH, OlympBench) |
Accuracy |
73.9 |
79.3 |
+5.4
|
| LCB-RB |
Accuracy |
58.82 |
74.87 |
+16.05
|
Main Takeaways
- P-GRPO consistently outperforms outcome-only baselines, demonstrating that rewarding reasoning quality (when correct) aids optimization.
- The Optimized-Degraded (OD) training method produces reward models that generalize well to other benchmarks (RewardBench) and significantly outperform general-purpose reward models on reasoning tasks.
- P-GRPO improves data efficiency: when all samples in a GRPO batch are correct, thinking rewards still provide gradient signal (unlike standard GRPO where advantage becomes zero).
- Qualitative analysis shows P-GRPO models handle edge cases (like negative numbers in square root problems) better due to more comprehensive reasoning.