PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

📝 Paper Summary

Efficient Vision-Language Models (VLMs) Visual Token Reduction

PACT accelerates Visual Language Models by pruning irrelevant visual tokens using a hidden-state-norm metric and merging redundant ones via a distance-bounded clustering algorithm, all without requiring additional training.

Core Problem

Visual Language Models suffer from high inference latency and memory costs because they process thousands of visual tokens, many of which contain redundant or unimportant information.

Why it matters:

High computational costs limit the deployment of powerful VLMs on resource-constrained devices.
Existing pruning methods rely on attention scores (incompatible with FlashAttention) or require specific architectures (like a [CLS] token), limiting their applicability to modern VLMs.
Prior methods typically address either irrelevance or redundancy, but rarely both simultaneously, leading to suboptimal reduction.

Concrete Example: In high-resolution models like LLaVA-OneVision, an image is split into many crops producing up to 8,748 tokens. Processing all these tokens, even those representing empty background or repetitive textures, wastes computation. Current methods like FastV fail because they need attention scores which FlashAttention does not output.

Key Novelty

PACT (Pruning and Clustering Tokens)

Identifies unimportant tokens using a metric based on hidden state norms and global query agreement, avoiding the need for explicit attention scores (unlike FastV).
Merges redundant tokens using Distance Bounded Density Peak Clustering (DBDPC), which guarantees that merged tokens are within a strict feature distance to prevent merging distinct visual concepts.

Evaluation Highlights

Achieves 50% visual token reduction on LLaVA-OneVision-7B with negligible performance loss.
Maintains 98.6% of original performance even at a 71.3% reduction ratio (only 1.4% drop), significantly outperforming state-of-the-art methods which drop at least 4.4%.
Combines pruning and clustering to outperform either technique used in isolation.

Breakthrough Assessment

8/10

Offers a training-free, FlashAttention-compatible solution that effectively combines pruning and merging. The high retention of performance at >70% reduction is a strong practical contribution for VLM efficiency.

⚙️ Technical Details

Problem Definition

Setting: Inference-time reduction of visual token sequences in Transformer-based Vision-Language Models.

Inputs: A sequence of visual hidden states H from the vision encoder/connector at a specific layer L.

Outputs: A reduced sequence of hidden states H' where n' < n, retaining critical visual information.

Pipeline Flow

Input Processing: Visual tokens enter Language Model
Pruning: EUTI Module identifies and removes unimportant tokens
Clustering: DBDPC Module clusters remaining tokens
Recovery: Re-integrates previously pruned tokens if they are close to cluster centers
Merging: Merges tokens within clusters and applies Proportional Attention

System Modules

EUTI (Pruning) (Token Reduction)

Identifies unimportant tokens without calculating full attention matrices.

Model or implementation: Algorithmic module (no trained weights)

DBDPC (Clustering) (Token Reduction)

Clusters important tokens to merge redundancy while strictly bounding cluster size.

Model or implementation: Algorithmic module (Distance Bounded Density Peak Clustering)

Token Merger (Token Reduction)

Combines hidden states of clustered tokens into single representatives.

Model or implementation: Weighted Average

Novel Architectural Elements

Integration of a pruning metric (EUTI) that relies on Hidden State Norms + Global Query agreement rather than attention maps.
A 'recovery' step where tokens pruned by EUTI are re-checked against DBDPC cluster centers and reintegrated if they are essentially duplicates of important features.

Modeling

Base Model: LLaVA-OneVision-7B (used for main experiments)

Training Method: Inference-time optimization only (training-free)

Compute: Requires no additional training compute. Designed to reduce inference memory and latency.

Comparison to Prior Work

vs. FastV: PACT is compatible with FlashAttention because it doesn't need explicit attention maps.
vs. ToME: PACT uses a bounded distance clustering (DBDPC) to prevent merging distinct features, whereas ToME merges based on similarity iteratively.
vs. LLaVA-PruMerge: PACT works on architectures without a [CLS] token and addresses redundancy (merging) not just irrelevance (pruning).
+ 2 more
vs. TRIM: PACT is text-agnostic, preserving visual information needed for future turns in a conversation.
vs. VTW [not cited in paper]: PACT operates at earlier layers for greater savings, whereas VTW removes tokens only in deeper layers.

Limitations

Effectiveness depends on selecting the appropriate layer L; too early and keys aren't distinct enough, too late and compute savings are lost.
Relies on the assumption that hidden state norm correlates with token importance.
Requires tuning of hyperparameters (lambda, dc) to balance reduction vs. performance.

Reproducibility

Code: https://github.com/orailix/PACT/tree/main

Codebase is publicly available at https://github.com/orailix/PACT/tree/main. The method is training-free and relies on hyperparameters (lambda for pruning, dc for clustering) detailed in the algorithm descriptions.

📊 Experiments & Results

Evaluation Setup

Evaluation of visual token reduction on multimodal tasks.

Benchmarks:

Benchmarks implicitly referenced via results (Visual Question Answering / Multimodal Understanding)

Metrics:

Performance drop (%)
Visual token reduction ratio (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance retention at high reduction ratios compared to state-of-the-art.
LLaVA-OneVision-7B Evaluation	Performance Drop	4.4	1.4	-3.0
LLaVA-OneVision-7B Evaluation	Performance Loss	0	~0	0

Experiment Figures

Comparison of pruning biases: Attention-based pruning vs. Position.

Histogram of Hidden State Norms at Layer 4.

Main Takeaways

PACT achieves a 50% reduction in visual tokens with negligible impact on model performance.
At aggressive reduction rates (71.3%), PACT preserves significantly more accuracy (1.4% drop) than competitors (4.4% drop).
Hidden state norms in early layers exhibit high variance, validating their use as a signal for token importance.
Combining pruning (removing useless tokens) with clustering (merging redundant tokens) yields better results than either approach alone.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, Keys, Queries, Values)
Vision-Language Models (e.g., LLaVA)
Clustering algorithms (Density Peaks)
FlashAttention mechanism

Key Terms

Visual Tokens: Vector representations of image patches processed by the vision encoder.

FlashAttention: An IO-aware exact attention algorithm that speeds up training/inference but does not output intermediate attention matrices.

EUTI: Efficient Unimportant Tokens Identification—PACT's pruning module that uses hidden state norms and global queries instead of full attention maps.

DBDPC: Distance Bounded Density Peak Clustering—PACT's clustering algorithm that ensures all points in a cluster are within a fixed distance from the center.

Hidden State Norm: The magnitude of the vector representation of a token; PACT uses this as a proxy for information content.

LLaVA-OneVision: A state-of-the-art VLM capable of handling high-resolution images by splitting them into crops.

Proportional Attention: A mechanism to weight merged tokens in attention calculations based on how many original tokens they represent.