Evaluation Setup
Evaluated on mathematical reasoning capability and LLM-as-a-Judge capability across iterations.
Benchmarks:
- GSM8K (Grade School Math)
- MATH (Challenging Math Problems)
- Gaokao2023En (College Entrance Exam Math)
- OlympiadBench (Math Olympiad Problems)
- AIME2024 (Math Competition)
- AMC2023 (Math Competition)
Metrics:
- Accuracy (Math)
- Accuracy (Judge consistency with human/oracle)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance gains of the Process-based Self-Rewarding method (M3 iteration) compared to the base Qwen2.5-Math-7B-Instruct model. |
| GSM8K |
Accuracy |
82.9 |
89.7 |
+6.8
|
| MATH |
Accuracy |
73.2 |
76.9 |
+3.7
|
| Gaokao2023En |
Accuracy |
63.0 |
70.1 |
+7.1
|
| OlympiadBench |
Accuracy |
39.5 |
44.1 |
+4.6
|
| AIME2024 |
Accuracy |
13.3 |
16.7 |
+3.4
|
| AMC2023 |
Accuracy |
50.0 |
52.5 |
+2.5
|
| Ablation study demonstrating the effectiveness of the proposed components compared to standard self-rewarding. |
| MATH |
Accuracy |
70.0 |
76.9 |
+6.9
|
Main Takeaways
- Iterative improvement: Accuracy consistently increases from M0 to M3, validating the self-rewarding loop.
- Process vs. Outcome: Standard outcome-based self-rewarding fails in math (performance degrades or stagnates), while process-based self-rewarding succeeds.
- Joint Capability: The model improves its ability to *judge* reasoning steps alongside its ability to *generate* them.
- Scaling: Effective on both 7B and 72B parameter scales (72B results show gains, though less dramatic relative improvement than 7B due to high baseline).