AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

📝 Paper Summary

Efficient Multi-Modal LLMs Video Understanding Inference Acceleration

AIM reduces computational costs in multi-modal LLMs by merging similar visual tokens before the model and progressively pruning unimportant ones within the layers, without requiring training.

Core Problem

Multi-modal LLMs, especially for video, process thousands of redundant visual tokens, leading to excessive computational costs and latency that limit deployment and context length.

Why it matters:

High computational demand restricts the use of powerful MLLMs on resource-constrained edge devices.
To fit memory constraints, models often sample few frames (e.g., 32) from long videos, causing significant loss of temporal information and reasoning capability.
Existing pruning methods are often static (pruning at a fixed layer) or require expensive fine-tuning/retraining.

Concrete Example: A standard video LLM (LLaVA-OV) processing a long video is limited to sampling 32 frames to stay within compute budgets, missing key details. AIM reduces tokens per frame, allowing the model to sample 192 frames within the same FLOPs budget, capturing far more temporal context.

Key Novelty

Adaptive Inference via Token Merging and Pruning (AIM)

Merges highly similar visual tokens *before* they enter the LLM to immediately reduce sequence length based on cosine similarity.
Progressively prunes the remaining visual tokens *within* LLM layers using PageRank on attention weights to identify and discard unimportant tokens.
Decouples text and visual pruning: text tokens are always preserved to maintain reasoning capability, while visual tokens are aggressively reduced.

Architecture

The AIM pipeline illustrating the two-stage token reduction process.

Evaluation Highlights

Reduces FLOPs by 6.8x and prefill time by 8.0x compared to the LLaVA-OV-7B base model with minimal performance degradation.
Surpasses the state-of-the-art LLaVA-OV-7B on the MLVU long video benchmark by +4.6 points when utilizing the efficiency gains to process 192 frames instead of 32.
Outperforms baseline methods like FastV and PDrop while using significantly fewer FLOPs (e.g., requiring only ~69.5% of their compute for comparable accuracy).

Breakthrough Assessment

8/10

Provides a highly effective, training-free solution to a critical bottleneck (visual token redundancy). The ability to improve performance on long videos by trading token density for frame count is a significant practical insight.

⚙️ Technical Details

Problem Definition

Setting: Training-free efficiency optimization for pre-trained Multi-Modal LLMs (specifically Image and Video LLMs) during inference.

Inputs: Visual data (Image/Video) and Text Prompt

Outputs: Text Response

Pipeline Flow

Visual Encoder (produces visual tokens)
Token Merger (merges similar tokens)
LLM Layers 1 to L (performs reasoning and progressively prunes visual tokens)

System Modules

Visual Encoder (Input Processing)

Converts input images or video frames into a sequence of visual embeddings.

Model or implementation: SigLIP (inherited from LLaVA-OV base)

Token Merger (Input Processing)

Merges adjacent tokens with high cosine similarity to reduce initial token count before the LLM.

Model or implementation: Non-parametric algorithm

LLM with Internal Pruner

Processes multimodal tokens. At specific layers, calculates token importance via PageRank on attention weights and prunes low-ranking visual tokens.

Model or implementation: Qwen2 (for LLaVA-OV) or Vicuna (for LLaVA-1.5)

Novel Architectural Elements

Two-stage redundancy reduction: Pre-LLM merging + Intra-LLM pruning.
Adaptive scheduler defined by parameters l1 (start layer) and l2 (end layer) to control the pruning aggressiveness dynamically.

Modeling

Base Model: LLaVA-OneVision-7B (Video) and LLaVA-1.5-7B (Image)

Training Method: Inference-time optimization only (Training-free)

Key Hyperparameters:

merging_retention_ratio: 25% (Video), 12.5% (Image)
pruning_start_layer_l1: 14 (Video), 13 (Image)
pruning_end_layer_l2: 22 (Video), 21 (Image)
+ 2 more
frame_count_base: 32 frames
frame_count_aim_long: 192 frames

Compute: Experiments run on A100 GPUs (implied by LLM-Viewer library standard usage, though exact hardware specs for inference speed not detailed beyond FLOPs).

Comparison to Prior Work

vs. FastV/VTW: AIM prunes progressively across layers rather than all-at-once, preserving information flow better.
vs. PDrop: AIM combines pre-LLM merging with intra-LLM pruning; PDrop only prunes at stage ends.
vs. LLaVA-Prumerge: AIM merges tokens *after* the encoder (agnostic to encoder architecture) and continues pruning inside the LLM.

Limitations

Pruning text tokens was found to degrade performance significantly, limiting the method to visual token reduction only.
Merging tokens across video frames disrupts temporal order, restricting merging to spatial tokens within individual frames.
Performance drops if pruning is applied too early in the LLM (before layer 14 for Video LLM), suggesting early layers are critical for cross-modal fusion.

Reproducibility

Code: https://github.com/LaVi-Lab/AIM

Code is publicly available on GitHub. Hyperparameters for reproduction (layers, ratios) are explicitly stated in the paper. Base models (LLaVA-OV, LLaVA-1.5) are open weights.

📊 Experiments & Results

Evaluation Setup

Zero-shot inference on video and image benchmarks using pre-trained MLLMs.

Benchmarks:

VideoMME (Comprehensive video understanding)
MLVU (Long video understanding)
Egoschema (Ego-centric video reasoning)
MVBench (Temporal action understanding)
GQA (Visual reasoning (Image))

Metrics:

Accuracy/Score
FLOPs (TFLOPs)
Prefill Time (seconds)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Efficiency comparisons demonstrate massive FLOPs reduction with minimal accuracy loss on the VideoMME benchmark.
VideoMME	FLOPs (T)	99.63	14.76	-84.87
VideoMME	Score	52.8	51.6	-1.2
Long video understanding results showing that AIM allows processing more frames within the same compute budget, leading to performance gains.
MLVU	Score	53.6	58.2	+4.6
VideoMME	Score	52.8	53.9	+1.1

Experiment Figures

Trade-off curves between Performance (y-axis) and FLOPs (x-axis) for AIM and baselines.

Main Takeaways

Visual data contains massive redundancy; 75% of visual tokens can often be removed with minimal impact on accuracy.
For long videos, the bottleneck is often the number of frames; reducing tokens per frame to allow more frames (higher temporal resolution) yields better results than keeping high-fidelity tokens for few frames.
Multi-modal LLMs require full visual information in early layers for cross-modal fusion, but are robust to aggressive visual pruning in later layers where text reasoning dominates.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, Layers)
Multi-Modal LLM basics (Visual Encoder + Adapter + LLM)
Basic graph theory (PageRank algorithm)

Key Terms

LLaVA-OneVision: A state-of-the-art open-source multi-modal LLM designed for both image and video understanding, used as the base model.

FLOPs: Floating Point Operations—a metric for the computational cost of the model inference.

Token Merging: Combining multiple token embeddings into a single token based on similarity (e.g., cosine similarity) to reduce sequence length.

Token Pruning: Completely removing tokens from the sequence at specific layers to save computation in subsequent layers.

PageRank: An algorithm originally for web search, used here to calculate the 'importance' of each visual token based on the attention matrix (how much other tokens attend to it).

Prefill Time: The time taken to process the initial input prompt (images + text) before generating the first new token.

MLVU: Multi-Label Video Understanding—a benchmark specifically focusing on long video reasoning and comprehension.