Object-Centric Vision Token Pruning for Vision Language Models

📝 Paper Summary

Vision-Language Models (VLMs) Efficient Inference Token Pruning

OC-VTP improves VLM efficiency by using a lightweight, pre-trained object-centric module to select the most representative vision tokens based on reconstruction error, without fine-tuning the VLM.

Core Problem

Vision tokens in VLMs are computationally expensive and information-redundant, but existing pruning methods rely on heuristic attention metrics that do not guarantee the selection of the most representative tokens.

Why it matters:

Vision tokens dominate inference cost (e.g., 2880 tokens in LLaVA-NeXT), creating a bottleneck for high-resolution or multi-image tasks.
Existing methods rely on indirect criteria (like attention averages) without optimality guarantees, potentially discarding important information.
Current approaches often require fine-tuning the VLM or accessing text tokens, limiting flexibility and increasing deployment complexity.

Concrete Example: In an image with a small fence, heuristic methods might discard the fence tokens because they have low attention scores compared to the background. Consequently, the VLM fails to answer questions about the fence (as seen in failure cases in Figure 4), whereas OC-VTP aims to preserve such distinct object features.

Key Novelty

Object-Centric Vision Token Pruning (OC-VTP)

Uses Slot Attention to aggregate vision tokens into object-centric slots, ensuring the selected tokens mathematically minimize the reconstruction error of the original features.
Introduces an Area-Weighted MSE loss during training to ensure small but important objects (which might be ignored by standard MSE) are preserved.
Designed as a plug-and-play module trained once on COCO, generalizable to various VLMs without requiring model-specific fine-tuning.

Architecture

The integration of the OC-Pruner into the VLM architecture.

Evaluation Highlights

Retaining only 11.1% of vision tokens on LLaVA-1.5 maintains 95.5% of original performance, outperforming the second-best method (VisionZip) which holds 93.1%.
Reduces inference latency by 65% (811.8 ms to 287.3 ms) on LLaVA-NeXT while preserving accuracy comparable to the full model.
Achieves a 17x reduction in prefill FLOPs (33.76 T to 1.95 T) for LLaVA-NeXT with negligible overhead from the pruner itself.

Breakthrough Assessment

8/10

Strong empirical results with a theoretically grounded approach (reconstruction minimization) rather than heuristics. The plug-and-play nature without VLM fine-tuning is highly practical.

⚙️ Technical Details

Problem Definition

Setting: Select a subset of vision tokens that minimizes information loss relative to the full set, to accelerate VLM inference.

Inputs: Original vision tokens V from the encoder.

Outputs: A reduced set of vision tokens V_p corresponding to the most representative features.

Pipeline Flow

Group 1: Feature Extraction (Vision Encoder)
Group 2: Pruning (OC-Pruner)
Group 3: Inference (LLM)

System Modules

Vision Encoder

Extract feature tokens from the input image.

Model or implementation: CLIP-ViT (for LLaVA) or Dynamic ViT (for Qwen)

Slot Attention Aggregator (Pruning)

Aggregate vision tokens into 'slots' based on feature similarity.

Model or implementation: Slot Attention module

Token Selector (Pruning)

Select the single most representative token for each slot based on maximum attention.

Model or implementation: Argmax operation

Projector & LLM

Process multimodal tokens to generate text response.

Model or implementation: LLaVA / Qwen

Novel Architectural Elements

Insertion of a pre-trained Slot Attention module strictly for token selection (pruning) rather than feature replacement.
Dual-input strategy for pruning: uses middle-layer tokens for calculating attention (reference) but prunes final-layer tokens.

Modeling

Base Model: Slot Attention (integrated into LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VL)

Training Method: Auto-encoding reconstruction task on vision tokens only.

Objective Functions:

Purpose: Ensure selected slots can reconstruct original visual features, prioritizing small objects.

Formally: Area-Weighted MSE (AW-MSE), weighting squared error by inverse mask area.

Training Data:

40,000 images randomly sampled from COCO dataset.
Images preprocessed into vision tokens via VLM encoders.

Key Hyperparameters:

slot_budgets: Sampled from {32, 64, 128, 192} during training
top_k: 1 (1 token selected per slot)

Compute: Single NVIDIA V100-32GB GPU for inference experiments.

Comparison to Prior Work

vs. FastV/VisionZip/HiPrune: Optimizes a direct reconstruction objective via Slot Attention rather than using heuristic attention proxies.
vs. SparseVLM: Does not require access to text tokens or the LLM decoder; operates purely on vision features.
vs. VisionZip: Does not require fine-tuning the VLM itself.
+ 1 more
vs. Slot Attention (original) [not cited in paper]: Uses Slot Attention for selection/pruning indices rather than using the slots themselves as the representation.

Limitations

Fixed slot count training works less effectively for dynamic-resolution models like Qwen2.5-VL compared to fixed-resolution LLaVA.
May miss objects with low contrast or very small size if the slot attention fails to attend to them (failure cases).
Performance drops slightly compared to full models (trade-off for speed).
Requires an extra (though small) module inference compared to heuristic pruning methods.

Reproducibility

Code: https://github.com/GarryLarry010131/OC-VTP

Publicly available: code repository (https://github.com/GarryLarry010131/OC-VTP). Missing: exact training time/epochs for the OC-pruner, though described as 'lightweight'.

📊 Experiments & Results

Evaluation Setup

Evaluate VLM performance on standard benchmarks after pruning vision tokens.

Benchmarks:

GQA (Visual Question Answering)
MMBench (Multimodal Evaluation)
POPE (Object Hallucination Evaluation)
TextVQA (OCR-based VQA)
ScienceQA (Multimodal Science Questions)

Metrics:

Accuracy (relative to vanilla model)
Inference Latency
FLOPs
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on LLaVA-1.5 across varying token budgets, showing OC-VTP generally maintaining higher relative accuracy than baselines.
Average across 10 benchmarks (LLaVA-1.5)	Relative Accuracy (%)	93.1	95.5	+2.4
Average across 10 benchmarks (LLaVA-1.5)	Relative Accuracy (%)	97.4	97.7	+0.3
Efficiency metrics demonstrating significant computational savings.
LLaVA-NeXT	Prefill FLOPs (Tera-FLOPs)	33.76	1.95	-31.81
LLaVA-NeXT	Inference Latency (ms)	811.8	287.3	-524.5
Ablation studies validating design choices like insertion layer and loss function.
LLaVA-1.5 (Average)	Relative Accuracy	93.5	94.6	+1.1

Experiment Figures

Comparison of inference latency (ms/image) vs. different benchmarks for Vanilla, Random Pruner, HiPrune, and OC-VTP.

Visualization of selected vision tokens on example images.

Main Takeaways

Consistently outperforms state-of-the-art pruning methods (FastV, VisionZip, HiPrune) across multiple budgets, especially in high-compression regimes (e.g., 11% tokens).
Demonstrates robust generalization: trained once on COCO, it works effectively on unrelated benchmarks like TextVQA and ScienceQA without fine-tuning.
The Area-Weighted MSE loss is critical for performance, likely because it prevents the model from ignoring small but semantically important objects during the reconstruction training task.
Interpretability: The selected tokens align well with object centers (cars, signs, animals), confirming the 'object-centric' claim.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformers (ViT)
Slot Attention mechanisms
Vision-Language Models (LLaVA architecture)
Token Pruning concepts

Key Terms

VTP: Vision Token Pruning—reducing the number of image patches processed by a model to save computation.

Slot Attention: An attention mechanism that aggregates input features into a set of discrete 'slots' or vectors, often used to represent distinct objects in a scene.

Object-Centric Learning: A learning paradigm where models are trained to represent scenes as compositions of distinct objects (slots).

AW-MSE: Area-Weighted Mean-Squared Error—a loss function proposed in this paper that weights reconstruction error by the inverse area of the object mask to prioritize small objects.

OC-pruner: The specific module proposed in this paper, consisting of Slot Attention and a selector, inserted between the vision encoder and the LLM.

MAC: Multiply-Accumulate operations—a measure of computational complexity.

AnyRes: A technique used in LLaVA-NeXT to handle high-resolution images by splitting them into sub-images (grids).