| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Video Understanding Results: mPLUG-2 demonstrates significant gains in video QA and captioning, setting new state-of-the-art results even against larger models. | ||||
| MSRVTT-QA | Top-1 Accuracy | 47.4 | 48.0 | +0.6 |
| MSRVTT Caption | CIDEr | 75.9 | 80.3 | +4.4 |
| LSMDC Retrieval | Recall@1 | 28.7 | 34.4 | +5.7 |
| Image-Text Results: mPLUG-2 remains competitive or superior on established image-text benchmarks. | ||||
| COCO Caption | CIDEr | 136.7 | 137.7 | +1.0 |
| VQA v2 | test-std Accuracy | 80.50 | 81.13 | +0.63 |
| Vision-Only and Language-Only Results: The model generalizes well to uni-modal tasks without losing performance. | ||||
| ImageNet-1K | Top-1 Accuracy | 87.8 | 88.5 | +0.7 |
| GLUE (Average) | Score | 92.6 | 92.7 | +0.1 |