Evaluation Setup
Evaluated 10 LVLMs on detailed caption generation and fine-grained VQA using the CompreCap dataset (560 images derived from COCO panoptic segmentation).
Benchmarks:
- CompreCap Captioning (Detailed Image Captioning) [New]
- CompreQA-P / CompreQA-Cap (Fine-grained Visual Question Answering (Tiny Objects)) [New]
Metrics:
- S_object (Object Coverage %)
- S_attribute (Attribute Score 0-5)
- S_relation (Relation Score 0-5)
- S_unified (Weighted Average 0-100)
- S-Cov (Pixel Coverage %)
- Statistical methodology: Reported mean and standard deviation across 3 evaluation runs.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance of LVLMs on detailed captioning (Unified Score). LLaVA-Next-34B and GPT-4o lead, but humans still outperform all models. |
| CompreCap Captioning |
S_unified |
62.99 |
60.05 |
-2.94
|
| CompreCap Captioning |
S_unified |
60.05 |
58.85 |
-1.20
|
| CompreCap Captioning |
S_unified |
50.32 |
58.48 |
+8.16
|
| Evaluation of fine-grained perception (tiny objects < 5% pixels). InternVL excels here. |
| CompreQA-P (Presence) |
Accuracy (%) |
35.28 |
91.67 |
+56.39
|
| CompreQA-Cap (Caption Selection) |
Accuracy (%) |
96.83 |
94.33 |
-2.50
|
Main Takeaways
- Caption length does not equal quality; MiniGPT4-v2 generates very long captions (350 words) but scores poorly (42.28) due to hallucinations and inaccuracy.
- Most LVLMs struggle with tiny objects (<5% of image), often ignoring them completely in captions or failing presence tests.
- The unified metric (S_unified) aligns better with human judgment than traditional n-gram metrics or CLIPScore, which fail to capture structural details in long texts.