| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance comparisons on Arithmetic datasets showing C3oT matches Long CoT accuracy while significantly reducing length. | ||||
| GSM8K | Accuracy | 40.1 | 40.3 | +0.2 |
| GSM8K | Average Length | 125 | 53 | -72 |
| MathQA | Accuracy | 40.7 | 41.0 | +0.3 |
| MathQA | Average Length | 119 | 54 | -65 |
| Performance on Commonsense datasets. | ||||
| ECQA | Accuracy | 51.1 | 52.0 | +0.9 |
| ECQA | Average Length | 91 | 45 | -46 |
| Comparison against Implicit-CoT (baseline that removes CoT entirely). | ||||
| GSM8K | Accuracy | 31.7 | 40.3 | +8.6 |