| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison showing CFT consistently outperforming SFT variants on the Qwen2.5-Math-7B base model across standard math benchmarks. | ||||
| Average (MATH, GSM8K, Minerva, AIME24, AMC23, OlymBench) | Accuracy | 50.4 | 57.1 | +6.7 |
| MATH | Accuracy | 73.9 | 80.6 | +6.7 |
| AIME 2024 | Accuracy | 30.0 | 50.0 | +20.0 |
| Efficiency comparison showing CFT matching heavy-compute RL methods. | ||||
| Average (5 math benchmarks) | Accuracy | 60.4 | 60.4 | 0.0 |
| Data efficiency comparison against official instruct models trained on millions of samples. | ||||
| Average (All 9 STEM benchmarks) | Accuracy | 47.7 | 48.1 | +0.4 |