Evaluation Setup
Evaluated on 18 diverse datasets covering captioning, general VQA, text-centric VQA, and document VQA.
Benchmarks:
- MME (Perception and Cognition Benchmark)
- TextVQA (Scene Text VQA)
- DocVQA (Document VQA)
- VQAv2 (General VQA)
Metrics:
- Accuracy
- CIDEr (for captioning)
- Score (MME)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MME |
Perception Score |
1487.5 |
1505.3 |
+17.8
|
| Monkey shows significant improvements in document-oriented VQA tasks due to its high-resolution processing capabilities. |
| DocVQA |
Accuracy |
62.6 |
66.5 |
+3.9
|
| ChartQA |
Accuracy |
66.3 |
67.6 |
+1.3
|
| DeepForm |
Accuracy |
59.0 |
68.3 |
+9.3
|
| TextVQA |
Accuracy |
61.5 |
77.5 |
+16.0
|
| VizWiz |
Accuracy |
47.7 |
53.6 |
+5.9
|
Main Takeaways
- High resolution is critical for text-centric and document tasks (DocVQA, TextVQA), yielding large gains over lower-resolution baselines.
- The 'Monkey' method of patching + LoRA enables high-resolution processing without the massive cost of retraining the vision encoder from scratch.
- Multi-level description generation improves model performance by providing richer training signals than standard short captions, especially when combined with high-resolution inputs.
- Ablation studies show that using LoRA with the patching strategy is more effective than simple interpolation, and multiple LoRAs can help spatial understanding.