Evaluation Setup
Zero-shot inference on video and image benchmarks using pre-trained MLLMs.
Benchmarks:
- VideoMME (Comprehensive video understanding)
- MLVU (Long video understanding)
- Egoschema (Ego-centric video reasoning)
- MVBench (Temporal action understanding)
- GQA (Visual reasoning (Image))
Metrics:
- Accuracy/Score
- FLOPs (TFLOPs)
- Prefill Time (seconds)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Efficiency comparisons demonstrate massive FLOPs reduction with minimal accuracy loss on the VideoMME benchmark. |
| VideoMME |
FLOPs (T) |
99.63 |
14.76 |
-84.87
|
| VideoMME |
Score |
52.8 |
51.6 |
-1.2
|
| Long video understanding results showing that AIM allows processing more frames within the same compute budget, leading to performance gains. |
| MLVU |
Score |
53.6 |
58.2 |
+4.6
|
| VideoMME |
Score |
52.8 |
53.9 |
+1.1
|
Main Takeaways
- Visual data contains massive redundancy; 75% of visual tokens can often be removed with minimal impact on accuracy.
- For long videos, the bottleneck is often the number of frames; reducing tokens per frame to allow more frames (higher temporal resolution) yields better results than keeping high-fidelity tokens for few frames.
- Multi-modal LLMs require full visual information in early layers for cross-modal fusion, but are robust to aggressive visual pruning in later layers where text reasoning dominates.