| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison shows beta-GRPO outperforming standard GRPO and other baselines across average EM scores. | ||||
| Average (7 datasets) | Exact Match (EM) | 48.62 | 50.55 | +1.93 |
| HotpotQA | Exact Match (EM) | 51.35 | 54.12 | +2.77 |
| Bamboogle | Exact Match (EM) | 46.12 | 49.80 | +3.68 |
| Efficiency analysis measuring reductions in sub-optimal search behaviors. | ||||
| Multi-hop Datasets | Over-search Rate | 21.10 | 19.89 | -1.21 |
| Multi-hop Datasets | Under-search Rate | 42.04 | 34.71 | -7.33 |