| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| General RLHF efficiency comparison using Llama-3-8B. | ||||
| General RLHF (Internal) | Reward Score | 46.8 | 46.7 | -0.1 |
| General RLHF (Internal) | Score per Token | 0.0544 | 0.0561 | +0.0017 |
| Overfitting analysis on Mathematical Reasoning (Qwen2.5-Math-Base). Training on AIME-24, Testing on AIME-25. | ||||
| AIME-25 (Test Set) | Pass@1 | 0.0 | 2.5 | +2.5 |
| AIME-25 (Test Set) | Pass@16 | 0.5 | 40.0 | +39.5 |
| Agentic Tool Use comparison (Qwen 2.5 Base 7B). | ||||
| Average of 4 Math Benchmarks | Average@32 | 21.85 | 24.10 | +2.25 |
| Average of 4 Math Benchmarks | Average@32 | 22.58 | 24.10 | +1.52 |