Evaluation Setup
Multimodal reasoning tasks requiring visual perception, manipulation, and logic.
Benchmarks:
- Visual Spatial Planning (VSP) (Multi-step planning and perceptual grounding)
- Jigsaw (Visual compositionality/puzzle solving)
- GUIQA (WebMMU) (GUI understanding and agent acting)
- Visual Search (Perceptual search)
Metrics:
- Accuracy (success rate of final answer)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results demonstrating AdaReasoner's performance against base models and proprietary SOTA. |
| VSP |
Accuracy |
31.64 |
97.64 |
+66.00
|
| VSP |
Accuracy |
80.10 |
96.60 |
+16.50
|
| Jigsaw |
Accuracy |
51.10 |
54.67 |
+3.57
|
| VSP (Navigation) |
Accuracy |
44.83 |
96.33 |
+51.50
|
| Unseen Tasks (Average) |
Accuracy |
46.50 |
75.81 |
+29.31
|
Main Takeaways
- Visual tools shift the bottleneck from model scale to tool quality: 3B and 7B models achieve similar near-perfect accuracy on VSP when equipped with tools.
- The model exhibits self-adaptive behaviors: it learns to adopt beneficial tools (like A*) and discard irrelevant ones (like using A* for verification) via RL signals.
- Generalization to unseen tools and tasks is significantly improved by the 'Adaptive Learning' strategy (randomizing tool names/descriptions during training), preventing overfitting to specific API signatures.