| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| ViPE outperforms or matches token-based baselines on both short and long video benchmarks while using 0 visual tokens at inference. | ||||
| EgoSchema | Accuracy | 38.4 | 51.2 | +12.8 |
| VideoMME | Score | 39.9 | 47.2 | +7.3 |
| Inference Latency | ms/sample | 391 | 132 | -259 |
| Computational Cost | TFLOPs | 8.4 | 1.3 | -7.1 |
| Ablation studies confirm the importance of hierarchical merging and optimal injection strategies. | ||||
| EgoSchema | Accuracy | 49.8 | 51.2 | +1.4 |