| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Self-refinement (no feedback) results show minimal or negative improvement across most models. | ||||
| RefineBench | Delta (Turn 5 - Turn 1) | 31.3 | 33.1 | +1.8 |
| RefineBench | Delta (Turn 5 - Turn 1) | Not reported in the paper | Not reported in the paper | -0.1 |
| RefineBench | Acc_t (Turn 1) | 29.1 | 29.1 | 0.0 |
| Guided refinement (with feedback) results demonstrate massive improvements, proving models can refine if told what to fix. | ||||
| RefineBench | Pass_t (Turn 5) | 18.7 | 98.4 | +79.7 |
| RefineBench | Pass_t (Turn 5) | 1.4 | 30.1 | +28.7 |