Evaluation Setup
Greedy decoding with max generation length 32,768 tokens
Benchmarks:
- MATH-500 (High school competition math (5 difficulty levels))
- AIME 2024 (Complex problem-solving (Math competition))
- GPQA (PhD-level science questions)
Metrics:
- Accuracy (ACC)
- Compression Ratio (CR)
- Average Token Length (LEN)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on DeepSeek-R1-Distill-Qwen-7B showing DAST improves accuracy on hard tasks while compressing output. |
| AIME 2024 |
Accuracy |
60.0 |
70.0 |
+10.0
|
| MATH-500 |
Accuracy |
82.8 |
83.6 |
+0.8
|
| MATH-500 |
Compression Ratio (CR) |
0.0 |
39.4 |
+39.4
|
| Performance on DeepSeek-R1-Distill-Qwen-32B showing DAST maintains high accuracy while achieving massive compression. |
| MATH-500 |
Accuracy |
96.0 |
96.0 |
0.0
|
| MATH-500 |
Compression Ratio (CR) |
0.0 |
47.9 |
+47.9
|
| AIME 2024 |
Accuracy |
46.7 |
60.0 |
+13.3
|
Main Takeaways
- DAST effectively navigates the trade-off between conciseness and performance, often outperforming 'shortest-is-better' baselines on hard tasks (AIME 2024).
- The method demonstrates true difficulty adaptation: it compresses simple MATH Level 1 problems aggressively (-58.5% length) while preserving length for Level 5 problems.
- Ablation studies show that combining Dual-Correct Pairs (DCP) for conciseness and Dual-Incorrect Pairs (DICP) for deep thinking is essential; removing DICP hurts accuracy, while removing DCP hurts compression.