| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| ReFT consistently outperforms SFT across different datasets and model architectures (Galactica, CodeLLAMA). | ||||
| GSM8K | Accuracy (N-CoT) | 43.59 | 53.30 | +9.71 |
| GSM8K | Accuracy (P-CoT) | 63.68 | 75.28 | +11.60 |
| SVAMP | Accuracy (P-CoT) | 75.40 | 79.19 | +3.79 |
| MathQA MCQ | Accuracy (N-CoT) | 56.01 | 60.13 | +4.12 |
| Inference-time strategies like Majority Voting and Reranking further boost ReFT's performance. | ||||
| GSM8K | Accuracy (P-CoT) | 77.0 | 81.2 | +4.2 |