| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Multi-SpatialMLLM consistently outperforms base InternVL models and proprietary SOTA models across diverse spatial tasks in the MultiSPA benchmark. | ||||
| MultiSPA | Average Accuracy | 28.87 | 56.11 | +27.24 |
| MultiSPA (Camera Translation) | Vector Accuracy | 0.00 | 18.00 | +18.00 |
| MultiSPA (Visual Correspondence) | Coordinate Accuracy | 1.67 | 49.00 | +47.33 |
| MultiSPA (Object Movement) | Vector Accuracy | 5.25 | 12.92 | +7.67 |
| BLINK | Visual Correspondence Accuracy | 39.0 | 89.5 | +50.5 |
| MultiSPA (Camera Vector) | Accuracy | 9.30 | 18.00 | +8.70 |