Evaluation Setup
Post-training on math datasets (MATH, DAPO, OpenRS) and evaluation on math, code, and general instruction following benchmarks.
Benchmarks:
- MATH500 (Mathematical reasoning)
- GSM8K (Grade school math)
- AMC / AIME24 (Competition math)
- LiveCodeBench / CRUX (Code generation)
- MMLU-Pro / IFEval (General multi-task & Instruction following)
Metrics:
- Pass@1
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Co-rewarding outperforms self-rewarding baselines and approaches ground-truth performance on mathematical reasoning benchmarks. |
| Average across 4 Math Benchmarks (Table 1) |
Pass@1 |
Not reported in the paper |
Not reported in the paper |
+4.42%
|
| Average across 4 Math Benchmarks (Table 2) |
Pass@1 |
Not reported in the paper |
Not reported in the paper |
+12.90%
|
| Llama-3.2-3B-Instruct (Multiple Benchmarks) |
Pass@1 |
Not reported in the paper |
Not reported in the paper |
+7.49%
|
| GSM8K |
Pass@1 |
Not reported in the paper |
94.01 |
Not reported in the paper
|
| Average Performance |
Relative Gain |
Not reported in the paper |
Not reported in the paper |
+1.72%
|
Main Takeaways
- Co-rewarding mitigates training collapse: Unlike baselines that plateau or degrade due to reward hacking, Co-rewarding maintains stable improvements.
- Cross-view supervision is effective: Both data-side analogy (Co-rewarding-I) and model-side teacher distillation (Co-rewarding-II) provide robust signals.
- Can surpass Ground Truth: In some easier tasks like GSM8K, self-generated signals allow for better exploration than strict GT supervision.
- Generalization: Improvements in math reasoning transfer to code generation (CRUX) without degrading general instruction following capabilities (IFEval, MMLU-Pro).