Evaluation Setup
Zero-shot or Fine-tuned evaluation on various video understanding benchmarks.
Benchmarks:
- LVU (Long Video Understanding) (Long-term Classification (Relationship, Speaking Style, etc.))
- Breakfast (Action Classification)
- COIN (Instructional Video Analysis)
- MSRVTT-QA / MSVD-QA / ActivityNet-QA (Open-ended Video QA)
- MSRVTT / MSVD / YouCook2 (Video Captioning)
- EpicKitchens-100 (Online Action Prediction)
Metrics:
- Top-1 Accuracy
- METEOR
- CIDEr
- Top-5 Accuracy
- Recall
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MA-LMM achieves State-of-the-Art performance on long-term video understanding benchmarks, outperforming specialized long-video models. |
| LVU (Long Video Understanding) |
Average Top-1 Accuracy |
56.9 |
60.7 |
+3.8
|
| Breakfast |
Top-1 Accuracy |
90.1 |
92.4 |
+2.3
|
| COIN |
Top-1 Accuracy |
89.6 |
92.0 |
+2.4
|
| MA-LMM outperforms recent Video-LLMs on Video QA, particularly on short video datasets, despite being designed for long videos. |
| MSRVTT-QA |
Top-1 Accuracy |
29.6 |
55.4 |
+25.8
|
| MSVD-QA |
Top-1 Accuracy |
46.1 |
64.2 |
+18.1
|
| ActivityNet-QA |
Top-1 Accuracy |
56.9 |
54.7 |
-2.2
|
| Ablation studies demonstrate the critical role of memory banks and their complementarity. |
| LVU |
Top-1 Accuracy |
46.0 |
60.7 |
+14.7
|
Main Takeaways
- Visual Memory Bank and Query Memory Bank are complementary; using both yields significantly better results (e.g., +14.7% on LVU) than using neither.
- Online processing with memory banks allows the model to handle arbitrarily long videos without exploding GPU memory or context length usage.
- The memory compression method successfully retains discriminative features while discarding temporal redundancy, as evidenced by high accuracy on long-form benchmarks like LVU and COIN.
- The model generalizes well to short-video tasks (QA and Captioning), often outperforming models explicitly designed for short clips (like Video-LLaMA).