| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| 32B Model Comparisons: AM-Distill-Qwen-32B vs. DeepSeek-R1-Distill-Qwen-32B shows consistent improvements across all benchmarks. | ||||
| AIME 2024 | Accuracy (Pass@1) | 72.6 | 72.7 | +0.1 |
| MATH-500 | Accuracy | 94.3 | 96.2 | +1.9 |
| GPQA-Diamond | Accuracy | 62.1 | 64.3 | +2.2 |
| LiveCodeBench | Accuracy | 57.2 | 59.1 | +1.9 |
| 72B Model Comparisons: AM-Distill-Qwen-72B vs. DeepSeek-R1-Distill-Llama-70B shows larger gains, particularly in math competitions. | ||||
| AIME 2024 | Accuracy (Pass@1) | 70.0 | 76.5 | +6.5 |
| MATH-500 | Accuracy | 94.5 | 97.0 | +2.5 |
| LiveCodeBench | Accuracy | 57.5 | 59.7 | +2.2 |