From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

📝 Paper Summary

Layered memory Agentic AI Long Video Understanding

MM-Mem is a hierarchical memory system for video agents that compresses fine-grained sensory data into abstract schemas using an Information Bottleneck objective, enabling efficient retrieval from gist to verbatim details.

Core Problem

Existing video agents either accumulate dense visual data causing high latency and redundancy (vision-centric) or compress everything into text causing detail loss and hallucination (text-centric).

Why it matters:

Current multimodal LLMs lack the dynamic memory management needed for long-horizon tasks, leading to context window overflow or forgetting
Vision-centric methods suffer from cognitive overload when processing hours of video, while text-centric methods lose the visual evidence needed for precise verification
Static memory mechanisms fail to mirror human cognitive efficiency, which balances abstract semantic understanding with specific perceptual recall

Concrete Example: In a long video, a text-centric agent might summarize a scene as 'a person walks a dog,' discarding the specific leash color. If later asked 'Was the leash red?', the agent hallucinates. A vision-centric agent stores every frame, overwhelming its context window.

Key Novelty

Fuzzy-Trace Theory-inspired Pyramidal Memory (MM-Mem)

Structures memory into three layers: Sensory Buffer (raw visual details), Episodic Stream (event summaries), and Symbolic Schema (abstract knowledge graph), mirroring human verbatim vs. gist traces
Uses SIB-GRPO (Semantic-Information Bottleneck) to train a memory manager that compresses sensory data into episodic traces, maximizing semantic retention while minimizing redundancy
Implements entropy-driven retrieval that starts with high-level schemas and only 'drills down' to low-level visual frames when the agent is uncertain

Architecture

Comparison of MM-Mem's pyramidal architecture against Vision-centric and Text-centric paradigms, showing the three memory layers.

Evaluation Highlights

Achieves state-of-the-art 63.8% accuracy on EgoSchema, outperforming proprietary Gemini 1.5 Pro (63.2%) and GPT-4o (61.9%)
+13.1% improvement on LVBench compared to the strong open-source baseline LongVA
Maintains 4-5% higher accuracy than leading baselines (VideoAgent, LongVA) as video duration increases from 600 to 3000 frames on LongVideoBench

Breakthrough Assessment

9/10

Proposes a cognitively grounded, mathematically rigorous (Information Bottleneck) architecture that solves the critical context-fidelity trade-off in long-video understanding, achieving superior performance over proprietary models.

⚙️ Technical Details

Problem Definition

Setting: Long-horizon video understanding and question answering

Inputs: Long video stream V and a user question Q

Outputs: Answer A derived from hierarchical memory retrieval

Pipeline Flow

Memory Construction: Sensory Buffer → Episodic Stream (via SIB-GRPO) → Symbolic Schema
Inference: Symbolic Schema Query → (if uncertain) Episodic Stream Query → (if uncertain) Sensory Buffer Query

System Modules

Sensory Buffer

Retain fine-grained visual evidence and subtitles

Model or implementation: CLIP-based visual encoder + text encoder

Memory Manager (Policy) (Memory Construction)

Compress sensory data into episodic traces

Model or implementation: LLM-based policy trained via SIB-GRPO

Symbolic Schema Builder (Memory Construction)

Abstract episodic events into a Knowledge Graph

Model or implementation: LVLM-based extractor + Unifier

Retrieval Controller

Decide whether to drill down to lower memory layers based on uncertainty

Model or implementation: Entropy thresholding mechanism

Novel Architectural Elements

Three-layer Pyramidal Memory (Sensory, Episodic, Symbolic) explicitly aligning visual/textual modalities with verbatim/gist traces
Top-down entropy-driven retrieval path that conditionally accesses raw visual buffers only when high-level schemas are insufficient

Modeling

Base Model: Qwen2-VL-7B-Instruct (for SIB-GRPO training and inference backbone)

Training Method: SIB-GRPO (Reinforcement Learning)

Objective Functions:

Purpose: Maximize semantic information about the target answer while minimizing memory length (compression).

Formally: J_SIB-GRPO = E[min(rho_i * A_i, clip(rho_i, 1-eps, 1+eps) * A_i)] - beta * D_KL(pi || pi_ref)
Purpose: Reward function for RL based on Information Bottleneck.

Formally: r(m) = log q_phi(y|m) - lambda * |m| + log p_ref(m)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: Visual encoder frozen; LLM backbone updated via LoRA

Training Data:

Video-QA pairs from training sets

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
clip_epsilon: Not reported in the paper
+ 1 more
beta (compression trade-off): Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LongVA: MM-Mem uses hierarchical retrieval to avoid processing all visual tokens, reducing redundancy while keeping details accessible
vs. Vgent: MM-Mem retains a 'Sensory Buffer' (verbatim trace) unlike Vgent which purely relies on text conversion (lossy)
vs. VideoAgent: MM-Mem uses a learned policy (SIB-GRPO) for memory construction rather than fixed heuristics
+ 1 more
vs. MemGPT [not cited in paper]: MM-Mem is multimodal and strictly hierarchical (pyramidal) rather than just managing context window via paging

Limitations

Dependency on the quality of the base LVLM (Qwen2-VL) for initial feature extraction
Reinforcement learning stability can be sensitive to reward shaping (lambda parameter)
Computational cost of maintaining the Symbolic Schema (knowledge graph) for extremely long videos is not fully analyzed
No explicit failure mode analysis for when the Symbolic Schema is constructed incorrectly

Reproducibility

Code: https://github.com/EliSpectre/MM-Mem

Code is publicly available at https://github.com/EliSpectre/MM-Mem. Specific hyperparameters for RL training (learning rate, epsilon) are not explicitly listed in the main text. Pre-trained weights and detailed prompts are not explicitly mentioned as available.

📊 Experiments & Results

Evaluation Setup

Offline Long-Video Understanding and Online Streaming Video QA

Benchmarks:

EgoSchema (Long-form egocentric video QA (multiple choice))
LVBench (Long video understanding (various tasks))
LongVideoBench (Long-horizon video QA)
EgoSchema-S (Streaming) (Streaming video QA) [New]

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against State-of-the-Art (SOTA) on standard offline benchmarks showing MM-Mem's superiority over both open-source and proprietary models.
EgoSchema	Accuracy	63.2	63.8	+0.6
EgoSchema	Accuracy	51.0	63.8	+12.8
LVBench	Accuracy	38.6	51.7	+13.1
LongVideoBench	Accuracy	51.9	58.6	+6.7
Performance on streaming settings (EgoSchema-S) where the agent must process unbounded video streams.
EgoSchema-S	Accuracy	39.6	61.3	+21.7
Ablation studies validating the contributions of specific memory components.
EgoSchema	Accuracy	59.2	63.8	+4.6
EgoSchema	Accuracy	60.4	63.8	+3.4

Experiment Figures

Accuracy trends on LongVideoBench as video duration increases.

Visualization of memory topology and retrieval path for a specific query.

Main Takeaways

MM-Mem effectively balances compression and detail retention, solving the 'fading' issue in long videos where performance typically degrades as video length increases.
The pyramidal structure allows for 'cognitively efficient' retrieval: most queries are answered by the abstract schema, saving compute, while difficult queries access raw details.
SIB-GRPO successfully learns to compress memory without losing task-critical semantics, outperforming heuristic or greedy memory selection methods.
Generalization is robust across both offline pre-processed video tasks and online streaming scenarios.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Information Bottleneck (IB) Theory
Reinforcement Learning (PPO/GRPO)
Knowledge Graphs

Key Terms

Fuzzy-Trace Theory: A cognitive theory proposing two parallel memory traces: 'gist' (fuzzy, meaning-based) and 'verbatim' (exact, detail-based)

SIB-GRPO: Semantic-Information Bottleneck Group Relative Policy Optimization—a training objective that balances memory compression with semantic preservation using RL

Sensory Buffer: The lowest memory layer storing fine-grained visual embeddings and raw subtitles for short durations

Episodic Stream: The middle memory layer containing chronological, compressed event summaries derived from the sensory buffer

Symbolic Schema: The highest memory layer organizing episodic events into a structured knowledge graph for abstract reasoning

Information Bottleneck: A technique to find a representation that compresses the input variable while preserving mutual information with the target variable

Entropy-driven retrieval: A strategy where the system only retrieves more detailed information if the uncertainty (entropy) of its current prediction is high

Verbatim trace: Memory representation that preserves exact surface details (e.g., specific visual frames)

Gist trace: Memory representation that captures the essential meaning or substance (e.g., text summary)

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs to stabilize training