| Benchmark | Metric | Baseline | This Paper | Ξ |
|---|---|---|---|---|
| MathVision | Accuracy | 78.9 | 89.1 | +10.2 |
| Diagnostic experiments reveal the hierarchy of failure modes: Reasoning (Trajectory) is strong, Perception (VE) is weak. | ||||
| Mini Benchmark (Internal) | Trajectory Accuracy | Not applicable | 90.0 | Not applicable |
| Mini Benchmark (Internal) | Visual Evidence (VE) Accuracy | Not applicable | 60.0 | Not applicable |
| Mini Benchmark (Error Subset) | VE Accuracy (Incorrect Answers) | Not applicable | 12.0 | Not applicable |
| Self-correction experiments show that single models cannot fix perception errors, even with strong hints. | ||||
| Mini Benchmark (Error Subset) | Accuracy (2nd Round) | Not applicable | 30.0 | Not applicable |
| Mini Benchmark (Error Subset) | Accuracy (2nd Round) | 30.0 | 88.5 | +58.5 |