Evaluation Setup
Zero/Few-shot evaluation on visual reasoning benchmarks
Benchmarks:
- ScienceQA (Multimodal science question answering)
- MathVista (Visual mathematical reasoning)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ScienceQA |
Accuracy |
81.22 |
85.33 |
+4.11
|
| MathVista |
Accuracy |
45.7 |
51.6 |
+5.9
|
| MathVista |
Accuracy |
49.9 |
51.6 |
+1.7
|
| MathVista |
Accuracy |
51.5 |
51.6 |
+0.1
|
Main Takeaways
- Integrating visual context into the planning stage significantly reduces decision hallucinations.
- Using MLLMs as high-level experts (e.g., comparing quantities) is more effective than low-level tools (e.g., detecting bounding boxes) for reasoning tasks.
- The framework generalizes across different backend models (Gemini, GPT-3.5) and benchmarks (ScienceQA, MathVista).