Evaluation Setup
Zero-shot evaluation across diverse multimodal benchmarks covering reasoning, perception, VQA, and hallucination.
Benchmarks:
- MME (Reasoning subset) (Multimodal Reasoning)
- HR-Bench 4K (Fine-grained Perception)
- VStarBench (Fine-grained Perception)
- ScienceQA (General VQA)
- HallusionBench (Hallucination Evaluation)
Metrics:
- Accuracy
- Score (Standard benchmark metrics)
- CIDEr (for COCO Caption)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Simple o3 demonstrates superior performance on reasoning and fine-grained perception benchmarks compared to the base model and proprietary baselines. |
| MME (Reasoning) |
Score |
138.6 |
188.6 |
+50.0
|
| HR-Bench 4K |
Accuracy |
69.1 |
76.5 |
+7.4
|
| VStarBench |
Accuracy |
59.2 |
72.1 |
+12.9
|
| ScienceQA |
Accuracy |
88.7 |
90.0 |
+1.3
|
| MMVet |
Score |
65.3 |
66.8 |
+1.5
|
| MME |
Score |
157.4 |
188.6 |
+31.2
|
| VStarBench |
Accuracy |
65.2 |
72.1 |
+6.9
|
Main Takeaways
- The 'reuse' tool, which re-inputs the original image, significantly boosts reasoning by introducing additional visual tokens, validating 'thinking with images'.
- The 'focus_area' tool (cropping) is essential for fine-grained perception tasks where target objects are small relative to the image.
- Including diverse training data (specifically MathV360K) greatly enhances logical reasoning capabilities, even if the specific math tasks are excluded.
- Simple o3 outperforms RL-based 'thinking with images' approaches (DeepEyes, Chain-of-Focus) on perception benchmarks without complex RL training.