Evaluation Setup
Post-training Qwen3 base models and evaluating on 5 diverse domains (Science, Instruction Following, Writing, Medical, Chat)
Benchmarks:
- HealthBench (Medical QA)
- Arena-Hard-V2 (General Chat)
- IFEval (Instruction Following)
- ResearchQA (Science QA)
- GPQA-Diamond (Science QA)
Metrics:
- Accuracy
- Score (0-100 or 0-1)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results comparing post-trained Qwen3-14B against proprietary SOTA models. |
| HealthBench |
Score |
67.2 |
69.3 |
+2.1
|
| HealthBench |
Score |
63.5 |
69.3 |
+5.8
|
| IFEval |
Score |
88.7 |
92.6 |
+3.9
|
| Arena-Hard-V2 |
Score |
5.2 |
74.4 |
+69.2
|
| Ablation comparing RubricHub rubrics vs RaR rubrics using Qwen3-14B. |
| HealthBench |
Score |
47.7 |
62.1 |
+14.4
|
Main Takeaways
- RubricHub unlocks SOTA performance on specialized domains (Medical) even for smaller 14B models, beating GPT-5.
- The coarse-to-fine generation strategy prevents score saturation; evolved criteria remain challenging even for 200B+ models.
- Positive-only criteria weights consistently outperform negative penalties due to grader inaccuracy on negative constraints.
- Performance hierarchy is consistent: Base < RuFT < RuRL < RuFT+RuRL, validating the two-stage post-training pipeline.