| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance on real-world software engineering benchmark (SWE-bench Verified). | ||||
| SWE-bench Verified | Pass@1 | 24.6 | 41.0 | +16.4 |
| SWE-bench Verified | Pass@1 | 42.0 | 41.0 | -1.0 |
| SWE-bench Verified | Pass@1 | 35.2 | 41.0 | +5.8 |
| Generalization to out-of-domain reasoning tasks (Coding & Math). | ||||
| HumanEval+ | Pass@1 | 73.2 | 79.5 | +6.3 |
| MATH | Accuracy | 63.2 | 66.4 | +3.2 |
| CRUXEval | Accuracy | 83.5 | 86.5 | +3.0 |