| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison of different training strategies on LLaMA-33B shows DMT achieves the best balance across all three metrics. | ||||
| GSM8K | Accuracy | 44.24 | 56.36 | +12.12 |
| MT-Bench | Score | 6.07 | 6.73 | +0.66 |
| HumanEval | Pass@1 | 18.9 | 25.00 | +6.10 |
| Scaling experiments reveal different data requirements for different abilities. | ||||
| MT-Bench | Score | 4.5 | 6.5 | +2.0 |