| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on Qwen3-VL-4B-Instruct showing trade-off between accuracy and efficiency vs Explicit CoT. | ||||
| GSM8k-Aug | Pass@1 | 79.3 | 55.4 | -23.9 |
| GSM8k-Aug | # L (Tokens) | 108.4 | 32.0 | -76.4 |
| MultiArith | Pass@1 | 95.5 | 93.4 | -2.1 |
| Comparison against other Latent Reasoning baselines on Qwen3-4B-Instruct. | ||||
| GSM8k-Aug | Pass@1 | 57.3 | 55.4 | -1.9 |
| Average (4 datasets) | Pass@1 | 47.3 | 55.4 | +8.1 |
| Inference speed analysis on GSM-Hard. | ||||
| GSM-Hard | Inference Time (s) | 8.55 | 1.84 | -6.71 |