| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Results on MATH-500 demonstrate that spurious rewards yield massive gains for Qwen2.5-Math-7B, nearly matching ground truth, but fail for Llama3.1. | ||||
| MATH-500 | Accuracy Gain | 49.4 | 78.5 | +29.1 |
| MATH-500 | Accuracy Gain | 49.4 | 70.8 | +21.4 |
| MATH-500 | Accuracy Gain | 49.4 | 73.5 | +24.1 |
| MATH-500 | Accuracy Gain | 36.8 | 30.4 | -6.4 |
| MATH-500 | Accuracy Gain | 36.8 | 44.0 | +7.2 |
| Prompting interventions show that forcing 'code reasoning' improves Qwen models (which have the latent skill) but hurts others. | ||||
| MATH-500 | Accuracy | 49.4 | 64.4 | +15.0 |
| MATH-500 | Accuracy | 36.8 | 15.2 | -21.6 |