Evaluation Setup
Pairwise preference prediction across general reward benchmarks and downstream tasks.
Benchmarks:
- RewardBench (General Reward Modeling)
- RewardBench-v2 (General Reward Modeling)
- RMB (General Reward Modeling)
- RM-Bench (General Reward Modeling)
- PPE (General Reward Modeling)
Metrics:
- Pairwise Accuracy
- Win Rate (for Offline RL/DPO)
- Best-of-N Accuracy (for Test-time Scaling)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Mix-GRM outperforms baselines on general reward benchmarks, with RLVR providing significant amplification. |
| Average of 5 Benchmarks |
Average Score |
76.9 |
79.4 |
+2.5
|
| Average of 5 Benchmarks |
Average Score |
70.1 |
75.1 |
+5.0
|
| Average of 5 Benchmarks |
Average Score |
65.2 |
75.1 |
+9.9
|
| Downstream utility experiments show Mix-GRM excels as a verifier and supervisor. |
| MATH |
Best-of-N Accuracy (N=10) |
37.7 |
43.2 |
+5.5
|
| Instruction Following |
Win Rate |
12.0 |
12.1 |
+0.1
|
| GSM8K |
Accuracy |
75.1 |
77.6 |
+2.5
|
Main Takeaways
- Reasoning mechanisms must align with task type: Breadth-CoT excels at subjective preference but harms objective correctness; Depth-CoT does the reverse.
- RLVR acts as a 'switching amplifier', causing the model to spontaneously polarize its reasoning style (Breadth vs. Depth) to match the task demands.
- Optimizing the structure of thought (Breadth/Depth) is more data-efficient than brute-force scaling of CoT length or dataset size.