MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

📝 Paper Summary

Video-LLMs Long-term Video Understanding Memory-Augmented Neural Networks

MA-LMM processes video frames sequentially using a compressed memory bank to store past visual and query features, enabling long-term video understanding without exceeding LLM context limits.

Core Problem

Existing Large Multimodal Models (LMMs) cannot handle long videos because concatenating frame tokens exceeds LLM context limits and GPU memory, while simple pooling loses temporal dynamics.

Why it matters:

Current models fail on long-form content like movies or instructional videos because they are restricted to short clips (e.g., 32 frames)
Alternative solutions like Video-LLaMA add complex external modules (Video Q-Former) that are computationally expensive and unsuitable for online processing
Naive averaging of temporal features destroys the sequential information necessary for understanding actions and events over time

Concrete Example: When processing a movie, a standard model like LLaVA is limited to ~256 tokens (very few frames), missing the plot. A naive pooling model like Video-ChatGPT averages features, losing the order of events. MA-LMM processes the movie sequentially, storing compressed history in memory to answer questions about the beginning while watching the end.

Key Novelty

Online Memory-Augmented Large Multimodal Model

Processes video frames one by one (online), storing historical features in a linear memory bank rather than feeding all frames to the LLM at once
Uses a 'Memory Bank Compression' technique that merges similar adjacent temporal tokens, keeping the memory size constant regardless of video length while preserving discriminative information

Evaluation Highlights

Achieves 60.7% (+3.8% over S5 baseline) average top-1 accuracy on the LVU (Long Video Understanding) benchmark, setting a new state-of-the-art
Outperforms Video-LLaMA on MSRVTT-QA (55.4% vs 29.6%) and MSVD-QA (64.2% vs 46.1%) despite using a simpler architecture without an extra video Q-Former
Reduces GPU memory usage significantly compared to offline processing methods, enabling analysis of arbitrarily long videos with fixed memory cost

Breakthrough Assessment

8/10

Significantly improves long-video understanding efficiency by solving the context window bottleneck via memory banks. The performance jumps on standard benchmarks are substantial, though the underlying components (Q-Former, ViT) are standard.

⚙️ Technical Details

Problem Definition

Setting: Long-term video understanding where a model predicts text labels or answers questions based on a video sequence V of length T.

Inputs: Video sequence V = [v1, ..., vT] and optional text query/prompt.

Outputs: Generated text response (classification label, caption, or QA answer).

Pipeline Flow

Visual Encoder (extracts frame features)
Visual Memory Bank (stores raw frame features)
Query Memory Bank (stores processed query features)
Q-Former (aligns visual/memory features to text space)
LLM (generates text response)

System Modules

Visual Encoder

Extracts visual features from input video frames

Model or implementation: ViT-G/14 from EVA-CLIP (frozen)

Q-Former with Memory Banks

Aligns visual features to text space while attending to historical context stored in memory banks

Model or implementation: BERT-base initialized from InstructBLIP

Large Language Model

Generates final text output based on aligned video queries and text prompts

Model or implementation: Vicuna-7B (frozen)

Novel Architectural Elements

Dual Memory Bank system: Visual Memory Bank (stores raw visual features) and Query Memory Bank (stores Q-Former output queries) integrated into Q-Former's attention layers as keys/values
Memory Bank Compression (MBC): Online compression algorithm that averages temporally adjacent, highly similar tokens to maintain fixed memory size

Modeling

Base Model: Vicuna-7B

Training Method: Supervised Fine-Tuning (SFT) of the Q-Former only

Objective Functions:

Purpose: Minimize the difference between generated text and ground truth.

Formally: Standard Cross Entropy Loss L = -sum(log P(w_i | w_<i, V))

Adaptation: Fine-tunes Q-Former weights; keeps Visual Encoder and LLM frozen

Trainable Parameters: Q-Former parameters (approx 100M-200M based on BLIP-2 architecture, exact number not explicitly reported in paper text but implied by InstructBLIP config)

Training Data:

WebVid-2M (video-text pairs)
CC3M (image-text pairs - converted to static video)

Key Hyperparameters:

Q_former_tokens_per_image: 32
memory_bank_threshold_M: 10 (implied from typical settings, exact M not in main text)
visual_encoder: ViT-G/14

Compute: 4 A100 GPUs

Comparison to Prior Work

vs. Video-LLaMA: MA-LMM processes online and uses a memory bank instead of a heavy, separate Video Q-Former, reducing GPU memory and parameters
vs. Video-ChatGPT: MA-LMM preserves temporal order via memory instead of averaging, yielding better accuracy
vs. MeMViT: MA-LMM integrates memory into a generative LLM pipeline rather than a classification-only backbone
+ 1 more
vs. MovieChat [not cited in paper]: MovieChat also uses memory for long videos, but MA-LMM introduces the specific compression technique based on token similarity

Limitations

Current large multimodal models (including MA-LMM) generally lack regression capability (e.g., predicting exact release year as a continuous variable)
Performance on very short videos (ActivityNet-QA) is slightly lower than models pre-trained on massive video-text datasets like VideoCoCa
Depends on frozen visual encoders and LLMs; end-to-end training might yield further gains but is computationally prohibitive

Reproducibility

Code: https://github.com/boheumd/MA-LMM

📊 Experiments & Results

Evaluation Setup

Zero-shot or Fine-tuned evaluation on various video understanding benchmarks.

Benchmarks:

LVU (Long Video Understanding) (Long-term Classification (Relationship, Speaking Style, etc.))
Breakfast (Action Classification)
COIN (Instructional Video Analysis)
MSRVTT-QA / MSVD-QA / ActivityNet-QA (Open-ended Video QA)
MSRVTT / MSVD / YouCook2 (Video Captioning)
EpicKitchens-100 (Online Action Prediction)

Metrics:

Top-1 Accuracy
METEOR
CIDEr
Top-5 Accuracy
Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MA-LMM achieves State-of-the-Art performance on long-term video understanding benchmarks, outperforming specialized long-video models.
LVU (Long Video Understanding)	Average Top-1 Accuracy	56.9	60.7	+3.8
Breakfast	Top-1 Accuracy	90.1	92.4	+2.3
COIN	Top-1 Accuracy	89.6	92.0	+2.4
MA-LMM outperforms recent Video-LLMs on Video QA, particularly on short video datasets, despite being designed for long videos.
MSRVTT-QA	Top-1 Accuracy	29.6	55.4	+25.8
MSVD-QA	Top-1 Accuracy	46.1	64.2	+18.1
ActivityNet-QA	Top-1 Accuracy	56.9	54.7	-2.2
Ablation studies demonstrate the critical role of memory banks and their complementarity.
LVU	Top-1 Accuracy	46.0	60.7	+14.7

Main Takeaways

Visual Memory Bank and Query Memory Bank are complementary; using both yields significantly better results (e.g., +14.7% on LVU) than using neither.
Online processing with memory banks allows the model to handle arbitrarily long videos without exploding GPU memory or context length usage.
The memory compression method successfully retains discriminative features while discarding temporal redundancy, as evidenced by high accuracy on long-form benchmarks like LVU and COIN.
The model generalizes well to short-video tasks (QA and Captioning), often outperforming models explicitly designed for short clips (like Video-LLaMA).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
Vision-Language Models (CLIP, BLIP-2)
Large Language Models (LLM) fine-tuning

Key Terms

Q-Former: Querying Transformer—a lightweight transformer from BLIP-2 that aligns visual features with the LLM's text embedding space using learnable query vectors

LLM: Large Language Model—a generative text model (e.g., Vicuna) used here as the decoder for the multimodal inputs

Online Processing: Processing data sequentially (frame-by-frame) rather than all at once, allowing immediate outputs and lower memory usage

Token Merging: A compression technique where similar tokens (vectors) are averaged together to reduce the total number of tokens without losing much information

Cross-Attention: An attention mechanism where the model focuses on relevant parts of the input (visual features) based on query vectors

Auto-regressive: A process where the output at the current time step depends on previous outputs or states, used here for updating memory