Evaluation Setup
Code generation evaluated on functional correctness (pass@1).
Benchmarks:
- HumanEval (Function synthesis)
- MBPP (Function synthesis)
Metrics:
- pass@1
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results showing EGCA outperforms baselines on standard benchmarks. |
| HumanEval |
pass@1 |
79.0 |
82.1 |
+3.1
|
| MBPP |
pass@1 |
67.4 |
68.9 |
+1.5
|
| HumanEval |
pass@1 |
78.7 |
82.1 |
+3.4
|
| HumanEval |
pass@1 |
81.6 |
82.1 |
+0.5
|
| Control experiments ruling out teacher distillation as the source of gains. |
| HumanEval |
pass@1 |
70.7 |
78.9 |
+8.2
|
Main Takeaways
- Precise credit assignment is the bottleneck in post-training, not just reward sparsity.
- Localizing the *first* semantic divergence is more effective than masking unexecuted code (StepCoder) or using uniform updates.
- The method is robust to the quality of the debugger model; even a weaker debugger (1.5B) provides sufficient signal for a stronger student (6.7B) to improve significantly.
- Gains saturate as debugger size increases, suggesting that once localization is reliable, further debugger capability yields diminishing returns.