| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results on RewardBench show significant gains in the Reasoning (Math/Code) category when merging with domain-specific models. | ||||
| RewardBench (Math Subset) | Accuracy | 0.5847 | 0.6987 | +0.114 |
| RewardBench (Math Subset) | Accuracy | 0.5847 | 0.7547 | +0.170 |
| RewardBench (Code Subset) | Accuracy | 0.7259 | 0.7799 | +0.054 |
| Downstream task performance using Best-of-16 sampling on GSM8K demonstrates that the improved reward accuracy translates to better generation selection. | ||||
| GSM8K (Best-of-16) | Accuracy | 0.490 | 0.540 | +0.050 |
| Generalizability results using Mistral-based architecture. | ||||
| RewardBench (Math Subset) | Accuracy | 0.4468 | 0.7568 | +0.310 |