Evaluation Setup
Multimodal evaluation across perception and reasoning benchmarks
Benchmarks:
- CV-Bench (Visual Perception)
- BLINK (Visual Perception (Hard))
- MMVP (Visual Perception)
- MMStar (General Multimodal)
Metrics:
- Accuracy
- Statistical methodology: Average of three runs reported
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ReVPT consistently outperforms the base instruct models, particularly on perception-heavy benchmarks like CV-Bench. |
| CV-Bench |
Accuracy improvement over base |
66.25 |
76.07 |
+9.82
|
| CV-Bench |
Accuracy improvement over base |
60.65 |
69.30 |
+8.65
|
| BLINK (Relation subset) |
Accuracy |
55.83 |
60.83 |
+5.00
|
| MMVP |
Accuracy |
63.33 |
70.33 |
+7.00
|
Main Takeaways
- Reinforcement Learning (RL) significantly boosts visual tool usage compared to Supervised Fine-Tuning (SFT) or text-based RL alone.
- The 'Cold Start' phase is essential; without it, models struggle to learn tool syntax effectively.
- Object Detection is the most impactful tool among the suite, with significant performance drops when removed.
- While perception improves drastically, there is a trade-off with general capability maintenance, mitigated by including general data (TACO) during training.