| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| APIGen Function Calling results show significant improvement from the vanilla baseline (Step 1) to the proposed Reflect-Retry method (Step 2) after training. | ||||
| APIGen | Accuracy | 67.9 | 86.4 | +18.5 |
| APIGen | Accuracy | 78.0 | 88.2 | +10.2 |
| APIGen | Accuracy | 63.7 | 72.4 | +8.7 |
| Countdown Math Equation results demonstrate large gains, particularly for Qwen2.5 models. | ||||
| Countdown | Accuracy | 23.6 | 58.3 | +34.7 |
| Countdown | Accuracy | 46.8 | 66.8 | +20.0 |
| Countdown | Accuracy | 58.4 | 76.1 | +17.7 |