| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Goldilocks consistently improves validation accuracy over the GRPO baseline across different model sizes and families, even after normalizing for compute resources. | ||||
| OpenMathReasoning | Accuracy | 0.640 | 0.685 | +0.045 |
| OpenMathReasoning | Accuracy | 0.551 | 0.598 | +0.047 |
| OpenMathReasoning | Accuracy | 0.781 | 0.798 | +0.017 |
| OpenMathReasoning | Accuracy | 0.760 | 0.783 | +0.023 |