METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

📝 Paper Summary

Efficient Multimodal LLMs Visual Token Pruning

METEOR improves the efficiency of multi-encoder MLLMs by progressively pruning redundant visual tokens across encoding, fusion, and decoding stages using rank-based allocation and text-guided attention.

Core Problem

Multi-encoder MLLMs (like EAGLE) achieve high performance but suffer from prohibitive computational costs due to the quadratic scaling of visual tokens, especially with high-resolution inputs.

Why it matters:

Processing high-resolution images with multiple encoders (e.g., 672x672 with dual encoders) generates thousands of tokens, causing extreme latency.
Existing pruning methods designed for single encoders fail to handle the redundancy overlap between multiple encoders.
Fixed pruning ratios perform poorly on fine-grained tasks like OCR, which require more visual details than general comprehension tasks.

Concrete Example: In Mini-Gemini, a 672x672 image processed by dual vision encoders generates 2880 visual tokens. Standard pruning might aggressively cut tokens needed for OCR, or keep redundant background tokens shared by both encoders, failing to balance speed and accuracy.

Key Novelty

Progressive Multi-Stage Pruning for Multi-Encoder MLLMs

Stage 1 (Encoding): Uses feature map rank to measure information richness, allocating fewer tokens to encoders with lower rank (less information).
Stage 2 (Fusion): Introduces 'Post-projection Fusion' where each encoder has a dedicated projector, allowing the removal of mutually redundant tokens that overlap across encoders.
Stage 3 (Decoding): Dynamically adjusts the number of retained tokens based on the 'Visual Attention Value' of the top-k most relevant attention heads, keeping more tokens for complex tasks like OCR.

Architecture

The 3-stage METEOR pipeline: (1) Encoding with rank-guided pruning, (2) Fusion with cross-encoder redundancy reduction, (3) Decoding with adaptive text-aware pruning.

Evaluation Highlights

Reduces visual tokens by 76% compared to EAGLE while maintaining comparable performance (only 0.3% average drop across 11 benchmarks).
Increases throughput by 46% and reduces TFLOPS by 49% compared to the EAGLE baseline.
Outperforms state-of-the-art pruning method FastV by 4.1% on average, with significant gains (+12.3%) on OCRBench due to adaptive token retention.

Breakthrough Assessment

8/10

First framework specifically addressing token redundancy in multi-encoder MLLMs. The rank-based allocation and adaptive pruning strategy effectively solve the 'OCR vs. General' trade-off that plagues fixed-ratio pruning.

⚙️ Technical Details

Problem Definition

Setting: Efficient inference for Multi-modal Large Language Models (MLLMs) that utilize multiple vision encoders.

Inputs: Input image I and text prompt P.

Outputs: Generated text response Y.

Pipeline Flow

Multi-vision Encoding (Independent Pruning within Encoders)
Multi-vision Fusion (Cooperative Pruning across Encoders)
LLM Decoding (Text-aware Instance-adaptive Pruning)

System Modules

Vision Encoders

Extract visual features using multiple backbones (e.g., CLIP, ConvNeXt, Pix2Struct).

Model or implementation: CLIP-L/14, ConvNeXt-L, Pix2Struct, EVA-02-L (varies by config)

Projectors (Fusion)

Map visual features to LLM input space independently for each encoder.

Model or implementation: Two-layer MLP (dedicated per encoder)

Fusion Pruner (Fusion)

Remove mutually redundant tokens across different encoders.

Model or implementation: Similarity-based selection

LLM Decoder

Generate text response while adaptively pruning visual tokens based on text prompts.

Model or implementation: Vicuna-v1.5-7B or Llama-3-8B

Novel Architectural Elements

Rank-guided collaborative token assignment strategy within multi-vision encoding.
Post-projection fusion architecture allowing independent adaptation and cross-encoder pruning.
Dynamic token budget mechanism in LLM decoding driven by 'Visual Contribution Level' (sum of VAV).

Modeling

Base Model: EAGLE (Vicuna-v1.5-7B or Llama-3-8B based)

Training Method: Supervised Fine-Tuning (SFT) with token pruning enabled

Adaptation: Full fine-tuning of the model (vision encoders frozen, projectors + LLM trained)

Training Data:

Pre-training: 558k image-text pairs (LLaVA-1.5 recipe)
SFT: 1.8M image-text pairs (EAGLE recipe)

Key Hyperparameters:

sft_visual_tokens: 576 (fixed during SFT)
inference_visual_tokens: Adaptive (avg ~242 or ~126)
projector_type: MLP

Compute: Training/Inference on Ascend 910B. Inference saves 49% TFLOPS vs EAGLE.

Comparison to Prior Work

vs. FastV/Pdrop: METEOR prunes across *all* stages (encoding, fusion, decoding), not just the LLM, and uses adaptive ratios instead of fixed ones.
vs. EAGLE: METEOR adds the progressive pruning framework, significantly reducing token count with negligible accuracy loss.
vs. Cambrian-1: METEOR demonstrates better performance on OCR with fewer tokens by using adaptive pruning rather than a fixed spatial aggregator.

Limitations

Rank calculation for sparsity allocation is done offline on a small batch to save compute, assuming robustness.
Complex three-stage pipeline adds implementation complexity compared to simple drop-based methods.
Performance on general tasks with extremely aggressive pruning (126 tokens) shows slight degradation compared to full models.

Reproducibility

Code: https://github.com/YuchenLiu98/METEOR

Code is publicly available. Training data follows open LLaVA/EAGLE recipes. No custom closed-source dependencies.

📊 Experiments & Results

Evaluation Setup

Evaluated on 11 multimodal benchmarks covering General VQA, OCR, and Hallucination.

Benchmarks:

SEEDBench (General VQA)
POPE (Hallucination Evaluation)
TextVQA (OCR VQA)
DocVQA (Document Understanding)
OCRBench (OCR)
MMBench (General VQA)
AI2D (Diagram Understanding)
ScienceQA (Science VQA)
ChartQA (Chart Understanding)
GQA (Visual Reasoning)
OKVQA (Knowledge VQA)

Metrics:

Accuracy
TFLOPS
Throughput (samples/s)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against the EAGLE baseline and other efficient methods, showing massive efficiency gains with minimal accuracy loss.
Average (11 datasets)	Accuracy	69.3	69.0	-0.3
Average	TFLOPS	26.21	13.42	-12.79
Average	Throughput	0.81	1.18	+0.37
Comparison against state-of-the-art pruning methods applied to the same base model (EAGLE).
Average (11 datasets)	Accuracy	64.9	69.0	+4.1
OCRBench	Accuracy	431	533	+102
DocVQA	Accuracy	54.2	71.1	+16.9

Experiment Figures

Analysis of feature map ranks and token sequence diversity.

Visual Attention Values (VAV) across different datasets.

Main Takeaways

Rank-based allocation is superior to equal allocation: allocating tokens based on feature map rank preserves critical information for complex encoders.
Adaptive pruning is essential for OCR: Fixed pruning ratios severely degrade OCR performance; dynamic budgets based on visual attention values recover this loss.
Post-projection fusion enables cross-encoder redundancy reduction, which is more effective than pruning encoders in isolation.
Average token similarity is a better pruning metric for shallow layers (where attention is noisy), while attention is better for deep layers.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformers (ViT) and self-attention mechanisms
Multi-modal LLM architectures (e.g., LLaVA, CLIP)
Token pruning/merging techniques
Singular Value Decomposition (SVD) and Rank

Key Terms

visual token: The vector representation of an image patch processed by a vision encoder.

multi-encoder MLLM: An MLLM that uses multiple distinct vision backbones (e.g., CLIP + ConvNeXt) to capture different aspects of an image.

rank of feature map: A mathematical measure derived from SVD indicating the information richness of a feature map; higher rank implies more unique information.

Visual Attention Value (VAV): The magnitude of attention weights between text tokens and visual tokens in the LLM, used to gauge the importance of visual information.

post-projection fusion: A strategy where visual tokens from different encoders are projected into the shared semantic space separately *before* being concatenated, allowing for cross-encoder pruning.

TFLOPS: Trillions of Floating Point Operations per Second, a measure of computer performance.