Evaluation Setup
Evaluate VLM performance on standard benchmarks after pruning vision tokens.
Benchmarks:
- GQA (Visual Question Answering)
- MMBench (Multimodal Evaluation)
- POPE (Object Hallucination Evaluation)
- TextVQA (OCR-based VQA)
- ScienceQA (Multimodal Science Questions)
Metrics:
- Accuracy (relative to vanilla model)
- Inference Latency
- FLOPs
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparative performance on LLaVA-1.5 across varying token budgets, showing OC-VTP generally maintaining higher relative accuracy than baselines. |
| Average across 10 benchmarks (LLaVA-1.5) |
Relative Accuracy (%) |
93.1 |
95.5 |
+2.4
|
| Average across 10 benchmarks (LLaVA-1.5) |
Relative Accuracy (%) |
97.4 |
97.7 |
+0.3
|
| Efficiency metrics demonstrating significant computational savings. |
| LLaVA-NeXT |
Prefill FLOPs (Tera-FLOPs) |
33.76 |
1.95 |
-31.81
|
| LLaVA-NeXT |
Inference Latency (ms) |
811.8 |
287.3 |
-524.5
|
| Ablation studies validating design choices like insertion layer and loss function. |
| LLaVA-1.5 (Average) |
Relative Accuracy |
93.5 |
94.6 |
+1.1
|
Main Takeaways
- Consistently outperforms state-of-the-art pruning methods (FastV, VisionZip, HiPrune) across multiple budgets, especially in high-compression regimes (e.g., 11% tokens).
- Demonstrates robust generalization: trained once on COCO, it works effectively on unrelated benchmarks like TextVQA and ScienceQA without fine-tuning.
- The Area-Weighted MSE loss is critical for performance, likely because it prevents the model from ignoring small but semantically important objects during the reconstruction training task.
- Interpretability: The selected tokens align well with object centers (cars, signs, animals), confirming the 'object-centric' claim.