| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance on standard benchmarks (HumanEval/MBPP) showing consistent gains over baselines. | ||||
| HumanEval(+) | Pass@1 | 72.0 | 78.7 | +6.7 |
| MBPP(+) | Pass@1 | 63.0 | 67.9 | +4.9 |
| HumanEval(+) | Pass@1 | 73.2 | 80.5 | +7.3 |
| Results on harder, competition-level benchmarks (LiveCodeBench) demonstrate robustness on complex reasoning tasks. | ||||
| LiveCodeBench (Hard) | Pass@1 | 11.9 | 17.0 | +5.1 |