← Back to Paper List

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

SD Sarkar, R Pautrat, O Miksik, M Pollefeys, I Armeni…
Stanford University, Microsoft Spatial AI Lab, ETH Zurich
arXiv, 2/2026 (2026)
MM Pretraining QA

📝 Paper Summary

Video Language Models (VideoLMs) Efficient Video Understanding Compressed Video Analysis
CoPE-VideoLM leverages native video codec primitives (motion vectors and residuals) to represent inter-frame changes as lightweight tokens, drastically reducing token usage while preserving temporal fidelity.
Core Problem
Current VideoLMs process videos as sequences of full RGB images, which is highly redundant, computationally expensive, and quickly fills context windows, forcing aggressive keyframe sampling that misses temporal details.
Why it matters:
  • Processing full images for every frame is redundant and slow, creating high latency (TTFT) unfit for real-time applications like robotics
  • Limited context windows force models to drop most frames (e.g., using only 64 frames), losing critical macro-events and micro-details in longer videos
  • Proprietary models scale to long contexts but require massive compute; open-source models struggle to balance detail with efficiency
Concrete Example: A 30 FPS video of 8 seconds generates 240 frames. Standard models might sample just 8 keyframes to fit the budget, missing 232 frames of action. CoPE-VideoLM keeps 1 full keyframe and encodes the other 239 frames as compact 'delta' tokens, capturing the full motion without the data explosion.
Key Novelty
Codec-Primitives Encoder (CoPE)
  • Leverage the structure of video codecs (I-frames vs. P-frames) to avoid redundant RGB encoding
  • Encode I-frames as standard image tokens but P-frames as lightweight 'delta tokens' derived from motion vectors and residuals
  • Use a specialized pre-training strategy to align these codec tokens with the standard image embedding space
Evaluation Highlights
  • Reduces time-to-first-token (TTFT) by up to 86.2% and token usage by up to 93% compared to standard VideoLMs (LLaVA-Video-7B)
  • Maintains or exceeds performance on 14 diverse benchmarks; e.g., +6.9% on PerceptionTest and +1.3% on NextQA over LLaVA-Video-7B
  • Enables processing of up to 8 hours of video (at 1 FPS) within a 1M token context window, an order-of-magnitude increase over dense RGB baselines
Breakthrough Assessment
8/10
A smart architectural shift leveraging intrinsic video properties (codecs) rather than just engineering better transformers. Drastically cuts compute while improving temporal resolution.
×