← Back to Paper List

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

M Dhouib, D Buscaldi, S Vanier, A Shabou
Not explicitly listed in snippet
arXiv, 4/2025 (2025)
MM Memory

📝 Paper Summary

Efficient Vision-Language Models (VLMs) Visual Token Reduction
PACT accelerates Visual Language Models by pruning irrelevant visual tokens using a hidden-state-norm metric and merging redundant ones via a distance-bounded clustering algorithm, all without requiring additional training.
Core Problem
Visual Language Models suffer from high inference latency and memory costs because they process thousands of visual tokens, many of which contain redundant or unimportant information.
Why it matters:
  • High computational costs limit the deployment of powerful VLMs on resource-constrained devices.
  • Existing pruning methods rely on attention scores (incompatible with FlashAttention) or require specific architectures (like a [CLS] token), limiting their applicability to modern VLMs.
  • Prior methods typically address either irrelevance or redundancy, but rarely both simultaneously, leading to suboptimal reduction.
Concrete Example: In high-resolution models like LLaVA-OneVision, an image is split into many crops producing up to 8,748 tokens. Processing all these tokens, even those representing empty background or repetitive textures, wastes computation. Current methods like FastV fail because they need attention scores which FlashAttention does not output.
Key Novelty
PACT (Pruning and Clustering Tokens)
  • Identifies unimportant tokens using a metric based on hidden state norms and global query agreement, avoiding the need for explicit attention scores (unlike FastV).
  • Merges redundant tokens using Distance Bounded Density Peak Clustering (DBDPC), which guarantees that merged tokens are within a strict feature distance to prevent merging distinct visual concepts.
Evaluation Highlights
  • Achieves 50% visual token reduction on LLaVA-OneVision-7B with negligible performance loss.
  • Maintains 98.6% of original performance even at a 71.3% reduction ratio (only 1.4% drop), significantly outperforming state-of-the-art methods which drop at least 4.4%.
  • Combines pruning and clustering to outperform either technique used in isolation.
Breakthrough Assessment
8/10
Offers a training-free, FlashAttention-compatible solution that effectively combines pruning and merging. The high retention of performance at >70% reduction is a strong practical contribution for VLM efficiency.
×