VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

📝 Paper Summary

Efficient Vision-Language Models (VLMs) Token Pruning

VLM-Pruner is a training-free method that selects visual tokens by starting from key object centers and expanding outward, prioritizing local details over distant background noise to balance efficiency and accuracy.

Core Problem

Existing token pruning methods either keep too many redundant tokens (importance-driven) or select scattered, incomplete tokens that miss object details (redundancy-reduction), degrading model performance.

Why it matters:

Visual tokens in high-resolution images can outnumber text tokens by thousands, causing quadratic computational costs in LLMs
Deployment on mobile devices is hindered by the high latency and memory usage of processing full visual sequences
Current methods fail to balance removing redundancy with preserving the dense local details needed for fine-grained tasks like OCR and grounding

Concrete Example: In an image of a truck, redundancy-reduction methods like DivPrune might select scattered background edges and miss the truck's body because the body parts look similar to each other. VLM-Pruner purposefully keeps the similar body tokens to preserve the truck's complete structure.

Key Novelty

Centrifugal Token Pruning with Buffering for Spatial Sparsity (BSS)

Visualizes selection as a 'centrifugal' process: starts with central pivot tokens and gradually expands outward, rather than randomly picking diverse points
Uses a 'spatial buffering' rule that makes spatially distant tokens appear more 'redundant' (and thus less likely to be picked early), forcing the model to fill in local object details first
Re-injects information from discarded tokens back into the selected ones using a weighted average, ensuring no semantic information is completely lost

Architecture

The three-stage pipeline of VLM-Pruner: Pivot Initialization, Centrifugal Expansion, and Recovery

Evaluation Highlights

Maintains 95.61% of original LLaVA-1.5-7B performance while pruning 88.9% of visual tokens (reducing from 576 to 64 tokens)
Outperforms redundancy-reduction baseline DivPrune by +2.48% and importance-based FastV by +7.93% on LLaVA-1.5-13B with 64 tokens
Achieves state-of-the-art results across 13 benchmarks and 5 different VLM architectures (including video models), showing consistent generalization

Breakthrough Assessment

8/10

Offers a smart, training-free heuristic that addresses a specific failure mode of prior pruning (scattered selection). Strong empirical results across many models, though the core innovation is a heuristic modification to greedy selection.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc compression of visual token sequences in pre-trained VLMs without retraining

Inputs: A sequence of N visual tokens (feature vectors) from the vision encoder

Outputs: A reduced subset of R tokens (where R << N) that best preserves the image's semantic information

Pipeline Flow

Input Processing: Feature extraction and channel screening
Stage 1: Pivot Initialization (Max-Min Selection)
Stage 2: Centrifugal Expansion (Greedy Selection + BSS)
Stage 3: Information Recovery (SWA)

System Modules

Channel Screener

Reduce feature dimensionality to speed up similarity calculations

Model or implementation: Statistical variance selection

Pivot Initializer (Selection)

Select initial anchor tokens that are semantically diverse

Model or implementation: Max-Min Distance algorithm

BSS Selector (Selection)

Iteratively select new tokens, prioritizing those spatially close to existing ones

Model or implementation: Parallel Greedy Algorithm with Distance Penalty

SWA Aggregator

Fuse information from discarded tokens into their nearest selected neighbors

Model or implementation: Weighted Average

Novel Architectural Elements

BSS-modulated similarity matrix: Dynamically altering the similarity metric during greedy selection based on spatial distance to enforce local density
Centrifugal selection topology: A specific ordering of token selection (pivots -> neighbors -> outliers) explicitly designed to preserve object completeness

Modeling

Base Model: Evaluated on LLaVA-1.5 (7B/13B), LLaVA-Next-7B, Qwen2-VL-7B, LLaVA-Video-7B

Training Method: Training-free pruning algorithm applied at inference time

Objective Functions:

Purpose: Select tokens that maximize diversity while maintaining spatial closeness.

Formally: Maximize sum of min(Similarity(i, j) * DistanceFactor(i, S))

Key Hyperparameters:

lambda: 0.5 (BSS gain factor)
q: 256 (screened channels)
beta: 0.3 (SWA fusion ratio)
+ 2 more
tau_0: 0.8 (initial similarity threshold)
delta_tau: 0.1 (threshold decay step)

Compute: Requires standard GPU for inference; pruning adds minimal overhead (O(N*|S|) complexity per loop) but reduces overall LLM decoding cost significantly

Comparison to Prior Work

vs. FastV: FastV keeps high-attention tokens which are often redundant/clustered; VLM-Pruner balances coverage and detail
vs. DivPrune: DivPrune selects scattered tokens to maximize diversity; VLM-Pruner uses BSS to force local density (completeness) before expanding
vs. ToMe: ToMe merges tokens; VLM-Pruner selects a subset and aggregates discarded info, explicitly using spatial priors [not cited in paper]

Limitations

Relies on the assumption that spatial proximity correlates with object completeness, which might fail for disjoint objects
Adds a preprocessing step (channel screening + greedy search) which has its own computational cost, though amortized by shorter sequence length
Hyperparameters (lambda, beta) might need tuning for different visual resolutions or domains beyond those tested
Pruning is applied at the 2nd decoder layer; applying it elsewhere might yield different results

Reproducibility

Code: https://github.com/Casey-bit/VLMPruner

Code is publicly available at https://github.com/Casey-bit/VLMPruner. The paper provides all hyperparameters (lambda, beta, thresholds) and detailed algorithms. Base models (LLaVA, Qwen) are open source.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on diverse multimodal benchmarks using pre-trained VLMs with token pruning applied

Benchmarks:

GQA (Visual Reasoning)
TextVQA (OCR / Text Reading)
POPE (Object Hallucination)
VideoMME (Video Understanding)

Metrics:

Accuracy (Acc)
Score (weighted metrics for some benchmarks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on LLaVA-1.5-13B with extreme pruning (64 tokens retained out of 576, ~89% pruning rate).
Average (9 benchmarks)	Normalized Performance %	90.20	92.68	+2.48
Average (9 benchmarks)	Normalized Performance %	84.75	92.68	+7.93
GQA	Accuracy	56.4	59.3	+2.9
Performance on LLaVA-1.5-7B across varying token budgets.
Average (9 benchmarks)	Retention Rate	100.00	95.61	-4.39

Experiment Figures

Radar chart comparing VLM-Pruner against 5 baselines across 7 benchmarks

Qualitative visualization of selected tokens on images

Main Takeaways

Redundancy-reduction methods (like DivPrune) often drop too much detail, hurting OCR and fine-grained tasks; VLM-Pruner recovers this via spatial buffering
Importance-based methods (like FastV) fail to remove enough redundancy, leading to lower efficiency-per-token
The method generalizes well to video (LLaVA-Video) and dynamic resolution models (LLaVA-Next, Qwen2-VL) without modification
SWA (Similarity-Weighted Aggregation) is crucial for recovering information from discarded tokens, acting as a soft-merging mechanism

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (especially self-attention complexity)
Vision-Language Models (connecting ViT to LLMs)
Greedy selection algorithms

Key Terms

visual tokens: Small patches of an image converted into vector embeddings that the language model processes like words

centrifugal paradigm: The paper's proposed strategy of selecting tokens starting from a central point of interest and expanding outward to neighbors

spatial sparsity: The physical distance between selected tokens in the 2D image grid

BSS: Buffering for Spatial Sparsity—a criterion that modifies similarity scores based on distance to prioritize selecting neighbors of existing tokens

SWA: Similarity-Weighted Aggregation—a method to merge discarded tokens into selected ones by weighted averaging based on similarity

pivot tokens: The initial set of tokens selected to represent distinct subjects, serving as anchors for the expansion process

max-min distance: A selection strategy that picks points that are as far away from each other as possible to maximize coverage

LLaVA: Large Language and Vision Assistant—a popular open-source VLM architecture

ViT: Vision Transformer—a neural network that processes images by splitting them into patches (tokens)