| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results comparing SPELL-trained models against their base and instruction-tuned counterparts across various sizes. | ||||
| Average (6 benchmarks, 16K context) | Average Score | 32.0 | 45.9 | +13.9 |
| Average (6 benchmarks, 16K context) | Average Score | 46.2 | 55.2 | +9.0 |
| Average (6 benchmarks, 16K context) | Average Score | 35.5 | 49.9 | +14.4 |
| Comparison with strong RLVR baseline (trained on static synthetic data from DeepSeek-R1). | ||||
| Average (6 benchmarks, 16K context) | Average Score | 61.5 | 63.5 | +2.0 |
| Test-time scaling results using pass@k metric. | ||||
| Average (6 benchmarks, 100K context) | pass@8 | 66.9 | 74.5 | +7.6 |