| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparisons of the proposed LLaVA-CoT model (11B) using test-time scaling against larger open-source models (up to 90B) and closed-source proprietary models on reasoning-heavy benchmarks. | ||||
| Average (6 benchmarks) | Average Score | 56.9 | 66.3 | +9.4 |
| Average (6 benchmarks) | Average Score | 62.3 | 66.3 | +4.0 |
| Average (6 benchmarks) | Average Score | 63.8 | 66.3 | +2.5 |
| Average (6 benchmarks) | Average Score | 63.6 | 66.3 | +2.7 |
| Ablation studies examining the impact of the dataset quality and the structured tags. | ||||
| Average (6 benchmarks) | Average Score | 56.6 | 59.0 | +2.4 |
| Average (6 benchmarks) | Average Score | 60.9 | 62.4 | +1.5 |