← Back to Paper List

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, A. Shah, Abhinav Shrivastava, Ser-Nam Lim
University of Maryland, College Park, Meta, University of Central Florida
Computer Vision and Pattern Recognition (2024)
MM Memory QA Benchmark

📝 Paper Summary

Video-LLMs Long-term Video Understanding Memory-Augmented Neural Networks
MA-LMM processes video frames sequentially using a compressed memory bank to store past visual and query features, enabling long-term video understanding without exceeding LLM context limits.
Core Problem
Existing Large Multimodal Models (LMMs) cannot handle long videos because concatenating frame tokens exceeds LLM context limits and GPU memory, while simple pooling loses temporal dynamics.
Why it matters:
  • Current models fail on long-form content like movies or instructional videos because they are restricted to short clips (e.g., 32 frames)
  • Alternative solutions like Video-LLaMA add complex external modules (Video Q-Former) that are computationally expensive and unsuitable for online processing
  • Naive averaging of temporal features destroys the sequential information necessary for understanding actions and events over time
Concrete Example: When processing a movie, a standard model like LLaVA is limited to ~256 tokens (very few frames), missing the plot. A naive pooling model like Video-ChatGPT averages features, losing the order of events. MA-LMM processes the movie sequentially, storing compressed history in memory to answer questions about the beginning while watching the end.
Key Novelty
Online Memory-Augmented Large Multimodal Model
  • Processes video frames one by one (online), storing historical features in a linear memory bank rather than feeding all frames to the LLM at once
  • Uses a 'Memory Bank Compression' technique that merges similar adjacent temporal tokens, keeping the memory size constant regardless of video length while preserving discriminative information
Evaluation Highlights
  • Achieves 60.7% (+3.8% over S5 baseline) average top-1 accuracy on the LVU (Long Video Understanding) benchmark, setting a new state-of-the-art
  • Outperforms Video-LLaMA on MSRVTT-QA (55.4% vs 29.6%) and MSVD-QA (64.2% vs 46.1%) despite using a simpler architecture without an extra video Q-Former
  • Reduces GPU memory usage significantly compared to offline processing methods, enabling analysis of arbitrarily long videos with fixed memory cost
Breakthrough Assessment
8/10
Significantly improves long-video understanding efficiency by solving the context window bottleneck via memory banks. The performance jumps on standard benchmarks are substantial, though the underlying components (Q-Former, ViT) are standard.
×