| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Zero-shot performance comparison using ChatGPT backbone. RCoT consistently improves over standard CoT and Double-Check baselines. | ||||
| GSM8K | Accuracy | 79.0 | 82.0 | +3.0 |
| AQuA | Accuracy | 51.3 | 55.5 | +4.2 |
| Date | Accuracy | 66.7 | 71.7 | +5.0 |
| SVAMP | Accuracy | 76.7 | 79.6 | +2.9 |
| Comparison with Self-Consistency (SC) and Self-Refine. RCoT outperforms SC on GSM8K with fewer samples. | ||||
| GSM8K | Accuracy | 81.6 | 82.0 | +0.4 |
| GSM8K | Accuracy | 80.7 | 82.0 | +1.3 |