Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Instruction Following Visual Token Compression

The paper proposes compressing redundant visual tokens and inhibiting irrelevant text-to-image attention to significantly improve MLLM instruction-following capabilities without sacrificing multimodal understanding.

Core Problem

Multimodal Large Language Models (MLLMs) lag significantly behind their base LLMs in instruction-following capability, largely due to high information redundancy in the visual modality which interferes with precise text generation.

Why it matters:

Replacing multimodal inputs with text-only inputs significantly increases instruction-following capabilities, indicating visual tokens actively degrade performance on complex instructions.
Simple down-sampling of images improves instruction following but severely hurts multimodal understanding (e.g., VQA performance drops).
Alignment with human intentions requires precise instruction following, which current MLLMs fail at compared to text-only LLMs.

Concrete Example: In a pilot study, simply spatially down-sampling visual tokens improved instruction following (e.g., outputting JSON) by removing redundancy, but it destroyed the model's ability to answer detailed questions about the image content (multimodal understanding).

Key Novelty

Visual-Modality Token Compression (VMTC) & Cross-Modality Attention Inhibition (CMAI)

VMTC identifies 'redundant' background image tokens using attention scores, clusters them to preserve semantic information, and merges them, keeping only essential foreground tokens.
CMAI prevents the LLM from attending to irrelevant image tokens during text generation by calculating a 'focus score' (derived from text-to-text and text-to-image attention) and masking out low-score pairs.

Architecture

The overall architecture of the proposed method, detailing the VMTC module within the Vision Transformer and the CMAI module within the LLM.

Evaluation Highlights

+9.5% improvement in instruction-following success rate compared to the LLaVA-1.5 baseline.
Achieves state-of-the-art instruction following while maintaining multimodal understanding (e.g., only -0.4% drop on GQA vs. -2.1% for simple down-sampling).
+7.8 score improvement on the MME benchmark compared to LLaVA-1.5.

Breakthrough Assessment

7/10

Identifies a novel correlation between visual redundancy and poor instruction following. The proposed solution effectively balances the trade-off between following complex instructions and retaining visual detail.

⚙️ Technical Details

Problem Definition

Setting: Multimodal instruction following and visual question answering

Inputs: Image I and Instruction/Text X_q

Outputs: Textual response X_a

Pipeline Flow

Visual Encoder (ViT) → Visual-Modality Token Compression (VMTC)
Projection Layer
Large Language Model (LLM) with Cross-Modality Attention Inhibition (CMAI)

System Modules

Visual Encoder (Visual Processing)

Encodes input image into a sequence of visual tokens

Model or implementation: CLIP-ViT-L/14

Visual-Modality Token Compression (VMTC) (Visual Processing)

Compresses redundant visual tokens by keeping high-attention tokens and clustering/merging low-attention ones

Model or implementation: Custom module inserted into ViT layers

Projection Layer

Aligns visual token dimensions with LLM input dimensions

Model or implementation: Two-layer MLP with GELU activation

LLM with CMAI

Generates text response while inhibiting attention to irrelevant visual tokens

Model or implementation: Vicuna-v1.5 (7B/13B)

Novel Architectural Elements

VMTC: Multi-stage token compression inside the ViT that clusters redundant tokens rather than just pruning them
CMAI: Dynamic attention inhibition mask injected into the LLM attention layers based on calculated text-token-to-image-token relevance scores

Modeling

Base Model: LLaVA-1.5 (Vicuna-v1.5 + CLIP-ViT-L/14)

Training Method: Instruction Tuning

Adaptation: Fine-tuning of projection layer and LLM (following LLaVA-1.5 settings)

Training Data:

Follows LLaVA-1.5 datasets and configurations

Key Hyperparameters:

token_compression_ratio: 50%
max_attention_inhibition_ratio: 60%

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaVA-Prumerge: VMTC prunes across multiple layers and uses clustering to preserve semantic info, whereas Prumerge focuses on efficiency in the last layer
vs. Spatially Down-sampled LLaVA: Preserves multimodal understanding (VQA accuracy) while achieving similar instruction-following gains
vs. InstructBLIP: Achieves higher instruction-following rates by directly addressing visual redundancy in the token sequence

Limitations

VMTC adds computational overhead due to K-Means clustering during inference
Performance on OCR-heavy tasks (TextVQA) can drop if background tokens containing text are aggressively compressed
Requires tuning of inhibition ratios (hyperparameters) for optimal balance

Reproducibility

The paper does not explicitly provide a code URL. It mentions using LLaVA-1.5 datasets and training configurations. VMTC and CMAI logic is described mathematically.

📊 Experiments & Results

Evaluation Setup

Evaluation on standard VQA benchmarks and a specific set of verifiable instruction-following tasks.

Benchmarks:

VQA-V2 (Visual Question Answering)
GQA (Visual Reasoning/QA)
TextVQA (OCR-based QA)
MME (Comprehensive MLLM Benchmark)
MMBench (Comprehensive MLLM Benchmark)
IFEval-based Tasks (Instruction Following (16 verifiable tasks)) [New]

Metrics:

Success Rate (for instruction following)
Accuracy/Score (for VQA benchmarks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Instruction Following capabilities showing significant gains over the baseline LLaVA-1.5.
Instruction Following Tasks	Success Rate	46.1	55.6	+9.5
Instruction Following Tasks	Success Rate	45.7	55.1	+9.4
Multimodal Understanding results show the method preserves performance unlike naive down-sampling.
GQA	Accuracy	63.3	63.3	0.0
GQA	Accuracy	61.7	63.3	+1.6
MME	Score	1510.7	1518.5	+7.8
Instruction Following Tasks	Success Rate	46.1	51.3	+5.2
Instruction Following Tasks	Success Rate	46.1	50.0	+3.9

Experiment Figures

Comparison of instruction-following capabilities between GPT-4V (multimodal) and GPT-4 (text-only) inputs.

Pilot experiment results showing the trade-off between instruction following and multimodal understanding under different spatial down-sampling ratios.

Main Takeaways

Visual redundancy in MLLMs negatively impacts instruction-following; removing it helps, but naive removal hurts understanding.
Clustering and merging redundant tokens (VMTC) is superior to simple pruning or spatial down-sampling for preserving semantic information.
The combination of token compression (VMTC) and attention inhibition (CMAI) yields the best balance, improving instruction following significantly (~9.5%) while maintaining VQA performance.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-attention, Cross-attention)
Vision Transformers (ViT)
Multimodal Large Language Models (LLaVA architecture)

Key Terms

MLLM: Multimodal Large Language Model—an AI system capable of processing and generating both text and images

ViT: Vision Transformer—a model architecture that processes images as sequences of patches (tokens) using self-attention

Instruction Following: The ability of a model to precisely adhere to constraints in a prompt (e.g., 'respond in JSON', 'limit to 10 words')

Token Compression: Reducing the number of tokens representing an image to decrease computational cost and redundancy

Spatial Down-sampling: A naive method of reducing image tokens by simply pooling or skipping spatial patches, often leading to information loss

Attention Inhibition: Selectively suppressing (masking) attention weights between specific token pairs to prevent the model from focusing on irrelevant information

K-Means: A clustering algorithm used here to group semantically similar redundant visual tokens before merging them

Causal Mask: A mask used in autoregressive language models to ensure predictions only depend on previous tokens; modified here to inhibit attention to specific image tokens