ViPE: Visual Perception in Parameter Space for Efficient Video-Language Understanding

📝 Paper Summary

Video-Language Models (Video-LLMs) Parameter-Efficient Fine-Tuning (PEFT)

ViPE replaces long visual token sequences with learnable perceptual weights injected directly into the LLM's parameters, enabling efficient long-video understanding without visual tokens at inference.

Core Problem

Existing Video-LLMs concatenate visual tokens with text, causing computational costs to scale quadratically with video length due to the LLM's self-attention mechanism.

Why it matters:

Long videos (e.g., 10K frames) would require millions of visual tokens, exceeding current model context limits
Standard approaches face severe latency and memory bottlenecks during inference when processing dense visual inputs
Simple compression strategies often degrade temporal coherence and semantic richness for long-range video modeling

Concrete Example: Processing a video with 10K frames using LLaVA would generate over 5.7 million visual tokens. This massive sequence length makes real-time inference or even loading the context impossible for standard LLMs.

Key Novelty

Video-to-Parameter Alignment Paradigm

Instead of feeding visual tokens as input to the LLM, transform video features into low-rank weight updates (perceptual weights) added to the LLM's weights
Use a hierarchical merge strategy to compress redundant visual frames into compact queries before generating these weights
Allows the LLM to 'see' the video by modifying its internal processing logic rather than reading a long description of it

Architecture

Overview of ViPE architecture showing the Visual Injection Module and Visual Perception Module integrating with the LLM via LoRA.

Evaluation Highlights

Reduces FLOPs by 85% and inference time by 65% compared to LLaVA-style baselines while maintaining comparable performance
Outperforms token-based Video-LLaVA on 5 long-video benchmarks (e.g., +12.8% on EgoSchema)
Maintains stable performance even when merging 60% of visual context, demonstrating high efficiency

Breakthrough Assessment

8/10

Significant efficiency breakthrough for long videos. shifting from token-based to parameter-based alignment is a novel and effective paradigm shift for multimodal LLMs.

⚙️ Technical Details

Problem Definition

Setting: Video-Language Understanding where video features modulate LLM parameters instead of serving as input tokens

Inputs: Video frames V and text prompt T

Outputs: Text response X generated by LLM

Pipeline Flow

Vision Encoder (extracts frame features)
Visual Injection Module (compresses features into queries)
Visual Perception Module (generates weight updates)
LLM Injection (adds weights to LLM layers)

System Modules

Vision Encoder

Extract visual features from sampled video frames

Model or implementation: CLIP ViT-L/14

Visual Injection Module

Compress video features into a small set of perceptual queries using attention and hierarchical merging

Model or implementation: Custom Transformer-like layers (Self-Attn, Cross-Attn, FFN)

Visual Perception Module

Project queries into low-rank weight updates for the LLM

Model or implementation: Linear Projectors

Large Language Model

Generate text response using weights modulated by visual information

Model or implementation: Vicuna-7B-v1.5

Novel Architectural Elements

Parameter-space visual injection: Visual information enters via weight modulation (ΔW) rather than input tokens
Hierarchical Context Merging (HCM): Layer-wise filtering of visual tokens based on relevance to queries to reduce cross-attention cost

Modeling

Base Model: Vicuna-7B-v1.5

Training Method: Two-stage training: Pre-training (weights only) and SFT (full model)

Objective Functions:

Purpose: Minimize the negative log-likelihood of the target text tokens.

Formally: Standard language modeling loss.

Adaptation: LoRA-style injection (rank=64, alpha=64)

Training Data:

Pre-training: 4M image-text (CC3M, COCO, etc.) + 3M video-text (WebVid, VALOR)
SFT: LLaVA-665K (images) + LLaVA-Video-178K (videos)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 256
visual_embedding_dimension: 512
+ 4 more
perceptual_query_count: 64
injection_interval: Every 4 layers
frames_sampled: 32 (SFT)
LoRA_rank: 64

Compute: Pre-training: 48 hours on 8x NVIDIA A800; Fine-tuning: 30 hours on 8x NVIDIA A800

Comparison to Prior Work

vs. Video-LLaVA: Injects visual info into weights instead of input tokens, reducing inference FLOPs by 85%
vs. LLaMA-VID: Eliminates visual tokens entirely during inference (0 visual tokens vs 2 per frame)
vs. Video-ChatGPT: Avoids information loss from simple pooling by using learnable perceptual queries and hierarchical merging

Limitations

Dependence on pre-trained visual encoders limits parameter-level alignment potential
Same learning strategy applied to all weight types (Q, K, V, O, M) despite different roles
Hierarchical merging still operates at token level rather than semantic level

Reproducibility

Code availability is not provided in the paper text. Detailed hyperparameters and dataset compositions are listed in the Appendix. Pre-trained CLIP and Vicuna weights are public.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on video benchmarks

Benchmarks:

MSVD-QA (Short Video QA)
ActivityNet-QA (Long Video QA)
EgoSchema (Long Video Understanding)
VideoMME (Comprehensive Video Analysis)
MVBench (Multi-modal Video Understanding)

Metrics:

Accuracy (%)
Score (1-5 scale for generative tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ViPE outperforms or matches token-based baselines on both short and long video benchmarks while using 0 visual tokens at inference.
EgoSchema	Accuracy	38.4	51.2	+12.8
VideoMME	Score	39.9	47.2	+7.3
Inference Latency	ms/sample	391	132	-259
Computational Cost	TFLOPs	8.4	1.3	-7.1
Ablation studies confirm the importance of hierarchical merging and optimal injection strategies.
EgoSchema	Accuracy	49.8	51.2	+1.4

Experiment Figures

FLOPs comparison between ViPE and LLaVA-V1.5-Video as the number of frames increases.

Main Takeaways

Token-free alignment via parameter injection is highly efficient, reducing FLOPs by ~85% while maintaining or exceeding SOTA accuracy
Increasing frame count (4 to 32) consistently improves performance without the prohibitive cost increase seen in token-based models
Hierarchical Context Merging effectively filters redundant visual information, allowing stable performance even with 60% token reduction
Injecting visual weights into all parameter types (Q, K, V, O, M) yields the best results compared to subsets like Q/K only

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, Cross-Attention)
Low-Rank Adaptation (LoRA)
Vision-Language Models (CLIP, LLaVA)

Key Terms

LoRA: Low-Rank Adaptation—a technique to fine-tune models by injecting small trainable rank-decomposition matrices into frozen weights

ViPE: Visual Perception in Parameter Space—the proposed method enabling token-free video understanding

Visual Perceptual Weights: Learnable weight offsets generated from video features and added to the LLM's weights to inject visual information

Hierarchical Context Merging: A strategy to progressively filter redundant visual tokens across layers based on cosine similarity to queries

FLOPs: Floating Point Operations per Second—a measure of computational cost

Q-Former: A module from BLIP-2 used to bridge vision and language modalities via learnable queries