CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

📝 Paper Summary

Video Language Models (VideoLMs) Efficient Video Understanding Compressed Video Analysis

CoPE-VideoLM leverages native video codec primitives (motion vectors and residuals) to represent inter-frame changes as lightweight tokens, drastically reducing token usage while preserving temporal fidelity.

Core Problem

Current VideoLMs process videos as sequences of full RGB images, which is highly redundant, computationally expensive, and quickly fills context windows, forcing aggressive keyframe sampling that misses temporal details.

Why it matters:

Processing full images for every frame is redundant and slow, creating high latency (TTFT) unfit for real-time applications like robotics
Limited context windows force models to drop most frames (e.g., using only 64 frames), losing critical macro-events and micro-details in longer videos
Proprietary models scale to long contexts but require massive compute; open-source models struggle to balance detail with efficiency

Concrete Example: A 30 FPS video of 8 seconds generates 240 frames. Standard models might sample just 8 keyframes to fit the budget, missing 232 frames of action. CoPE-VideoLM keeps 1 full keyframe and encodes the other 239 frames as compact 'delta' tokens, capturing the full motion without the data explosion.

Key Novelty

Codec-Primitives Encoder (CoPE)

Leverage the structure of video codecs (I-frames vs. P-frames) to avoid redundant RGB encoding
Encode I-frames as standard image tokens but P-frames as lightweight 'delta tokens' derived from motion vectors and residuals
Use a specialized pre-training strategy to align these codec tokens with the standard image embedding space

Evaluation Highlights

Reduces time-to-first-token (TTFT) by up to 86.2% and token usage by up to 93% compared to standard VideoLMs (LLaVA-Video-7B)
Maintains or exceeds performance on 14 diverse benchmarks; e.g., +6.9% on PerceptionTest and +1.3% on NextQA over LLaVA-Video-7B
Enables processing of up to 8 hours of video (at 1 FPS) within a 1M token context window, an order-of-magnitude increase over dense RGB baselines

Breakthrough Assessment

8/10

A smart architectural shift leveraging intrinsic video properties (codecs) rather than just engineering better transformers. Drastically cuts compute while improving temporal resolution.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering and Understanding under limited token budgets

Inputs: Video sequence V = (F(1), ..., F(T)) and textual instruction/question

Outputs: Generated textual response

Pipeline Flow

Video Re-encoding (standardize GOP structure)
I-frame encoding (standard Vision Encoder)
P-frame encoding (Delta-Encoder extracting motion/residual tokens)
Token Interleaving (I-tokens + Delta-tokens)
LLM Reasoning (generate answer)

System Modules

Vision Encoder (Input Processing)

Encode I-frames (full images) into dense feature tokens

Model or implementation: SigLIP (frozen)

Delta-Encoder (Input Processing)

Encode P-frames (motion vectors + residuals) into sparse, lightweight tokens aligned with image space

Model or implementation: Specialized Transformer-based encoder (Delta-Encoder)

Token Interleaver (Input Processing)

Construct the final sequence by concatenating I-frame tokens and P-frame delta tokens in temporal order

Model or implementation: Deterministic concatenation

Language Model

Process visual tokens and text instructions to generate answers

Model or implementation: Qwen2 (fine-tuned)

Novel Architectural Elements

Hybrid token stream combining dense RGB tokens (I-frames) with sparse Codec Primitive tokens (P-frames)
Delta-Encoder architecture: dual-branch transformer processing motion vectors and residuals separately before fusion
P-frame fusion strategy: Grouping multiple P-frames to trade off temporal resolution for token efficiency

Modeling

Base Model: LLaVA-Video-7B (SigLIP vision encoder + Qwen2 LLM)

Training Method: Two-stage training: (1) Alignment pre-training of Delta-Encoder, (2) End-to-end instruction fine-tuning

Objective Functions:

Purpose: Align Delta-Encoder outputs with frozen vision encoder space (Stage 1).

Formally: L_align = MSE(X_P_hat, phi_RGB(I_hat)) where X_P_hat is the reconstructed feature from codec primitives.
Purpose: Standard autoregressive language modeling (Stage 2).

Formally: Next-token prediction loss on instruction tuning data.

Training Data:

Pre-training: 0-30s videos from PerceptionTest training set
Fine-tuning: LLaVA-Video-178K dataset (1.39M QA pairs)

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 128
pre_training_video_length: 0-30 seconds
+ 3 more
video_fps: 30 FPS re-encoded
GOP_size: 240 frames
delta_tokens_per_frame: 8 (4 motion + 4 residual)

Compute: 64 A100-80G GPUs for 14 days (21K GPU hours)

Comparison to Prior Work

vs. LLaVA-Video: Uses codec primitives instead of full RGB frames for most timesteps, reducing token count.
vs. Video-LaVIT: Treats codec data as continuous embeddings aligned to vision space, not discrete language tokens.
vs. EMA: Retains residuals and creates a variable-length sequence preserving temporal order, rather than a fixed summary.
+ 1 more
vs. ChatUniVi [not cited in paper]: ChatUniVi uses learned clustering to merge tokens; CoPE uses structural codec sparsity defined by the video format itself.

Limitations

Requires access to raw video stream or re-encoding step to extract specific GOP structures (I/P frames)
Dependent on the quality of the underlying video codec; compression artifacts could theoretically propagate
P-frame fusion involves a trade-off: extremely sparse fusion might miss very fast sub-second actions
Current implementation standardizes on MPEG-4; generalization to other modern codecs (AV1, etc.) not explicitly explored

Reproducibility

Code: https://cope.github.io

Code is publicly available at https://cope.github.io. Uses public datasets (PerceptionTest, LLaVA-Video-178K). Re-encoding videos to specific GOP structure is a necessary preprocessing step.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Fine-tuned Video QA across varying video lengths and types

Benchmarks:

PerceptionTest (General Video QA)
NextQA (Causal/Temporal QA)
ActivityNet-QA (Long-form QA)
VideoMME (Comprehensive Video Understanding)
MVBench (Fine-grained temporal understanding)
Video-TT (Long-form understanding)
ScanQA (Spatial scene understanding)

Metrics:

Accuracy (%)
Score (0-5 or 0-100 depending on benchmark)
Time-to-first-token (TTFT)
Token count
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against the direct baseline LLaVA-Video-7B shows CoPE-VideoLM achieving higher accuracy with significantly fewer tokens.
PerceptionTest	Accuracy	62.4	69.3	+6.9
NextQA	Accuracy	78.4	79.7	+1.3
Efficiency metrics demonstrate the drastic reduction in computational cost.
Inference Latency	Time-to-first-token (TTFT) reduction	0.0	86.2	86.2
Inference Cost	Token usage reduction	0.0	93.0	93.0
Long-form video understanding results show scalability to extended contexts.
Video-TT	Score	59.3	64.5	+5.2
Video-MMMU	Score	47.7	50.1	+2.4

Experiment Figures

Scalability plot: Maximum video duration (hours) vs. Token Budget (M tokens) for different methods.

Main Takeaways

Codec primitives (motion/residuals) are a highly effective substitute for dense RGB frames, preserving necessary information for complex reasoning while shedding massive redundancy.
The approach scales much better than RGB baselines: token usage grows slowly, allowing 8-hour videos to fit in context windows that normally hold only minutes of dense video.
Performance improvements are consistent across general QA, temporal reasoning, and long-form tasks, suggesting the delta-tokens capture robust semantic features.
Pre-training the Delta-Encoder to align with the image encoder's space is a crucial step for effective integration into the VideoLM.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architectures and Vision Transformers (ViT)
Basics of Video Compression (I-frames, P-frames, GOP structure)
Multimodal LLMs (alignment, instruction tuning)

Key Terms

VideoLM: Video Language Model—AI systems that extend Large Language Models to perceive and reason about video content

TTFT: Time-to-first-token—the latency between sending a request and receiving the first generated word, dominated by processing the video input

GOP: Group of Pictures—a specific arrangement of frames in video compression (e.g., one I-frame followed by many P-frames)

I-frame: Intra-coded frame—a fully specified image in a video stream, serving as a reference point (like a JPEG)

P-frame: Predictive frame—a video frame encoded only as changes (motion/residuals) relative to a previous frame

Motion Vectors: Data in compressed video describing how blocks of pixels move from one frame to the next (optical flow approximation)

Residuals: The error or difference between the predicted frame (moved by motion vectors) and the actual target frame

Delta-tokens: The novel lightweight tokens proposed by this paper, representing the information in P-frames (motion + residuals)

SigLIP: Sigmoid Loss for Language Image Pre-training—a vision encoder used to extract features from images

Qwen2: A specific family of Large Language Models used here as the reasoning backbone