Multimodal Language Models See Better When They Look Shallower

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Vision Transformer (ViT) Analysis

Systematic analysis reveals that shallow and middle ViT layers often outperform deep layers on fine-grained visual tasks, motivating a simple multi-layer fusion strategy that improves MLLM performance.

Core Problem

Most MLLMs use only the final or penultimate layer of the Vision Transformer (ViT) based on heuristics, ignoring potentially richer fine-grained visual information present in shallower layers.

Why it matters:

Current deep-layer bias leads to suboptimal performance on tasks requiring fine-grained perception like counting and positioning.
Scaling up the LLM size does not compensate for the loss of visual detail in the deep layers of the vision encoder.
Existing fusion methods are often ad-hoc or heuristic rather than grounded in a systematic analysis of layer-wise efficacy.

Concrete Example: In position-related tasks within the MME benchmark, using layer 18 (middle) outperforms the commonly used penultimate layer by 20%, showing that deep layers lose critical localization information.

Key Novelty

Layer-wise Visual Probing and Simple Fusion

Systematically trains MLLMs connecting to every single ViT layer individually to empirically measure their downstream performance across diverse benchmarks.
Identifies three distinct representation spaces (shallow, middle, deep) based on cosine similarity and performance patterns.
Proposes a lightweight fusion method that linearly projects and sums features from one representative layer of each group (shallow, middle, deep) to capture both semantics and fine details.

Architecture

Illustration of the layer-wise representation groups and the fusion strategy.

Evaluation Highlights

Layer 18 (middle) outperforms the penultimate layer by 20% on MME position tasks and 3% on CVBench using a 1.4B model.
Proposed simple fusion method consistently outperforms single-layer baselines and complex fusion methods like DenseConnector and MMFuser across 10 benchmarks.
On POPE (hallucination), half of the middle layers outperform the penultimate layer, suggesting reduced hallucination when using features that aren't over-optimized for text alignment.

Breakthrough Assessment

7/10

Provides the first comprehensive systematic analysis of layer-wise utility in MLLMs, challenging the standard practice of using only deep layers. The proposed solution is simple yet effective.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Large Language Model (MLLM) construction where a pre-trained Vision Transformer encodes images into features for an LLM.

Inputs: Image I and text instructions/queries.

Outputs: Textual response generated by the MLLM.

Pipeline Flow

Vision Encoder (CLIP-ViT) extracts features from specific layers
Feature Selection/Fusion (Single layer or Multi-layer fusion)
Connector (MLP) projects visual features to LLM space
LLM (MobileLLaMA/Vicuna) generates response

System Modules

Vision Encoder

Extract visual features from images using a pre-trained transformer

Model or implementation: CLIP ViT-L/14 (336px) [frozen]

Connector

Project visual features into the language embedding space

Model or implementation: Two-layer MLP (GELU activation)

Language Model

Generate text response based on visual and textual inputs

Model or implementation: MobileLLaMA-1.4B/2.7B or Vicuna-v1.5-7B

Novel Architectural Elements

Lightweight Feature Fusion: Integrates features from shallow (layer 8), middle (layer 18), and deep (layer 23) layers using simple addition of projected features, grounded in empirical layer grouping.

Modeling

Base Model: CLIP ViT-L/14 (vision) + MobileLLaMA-1.4B/2.7B or Vicuna-7B (language)

Training Method: Supervised Fine-Tuning (SFT) in two stages (Feature Alignment and Visual Instruction Tuning)

Objective Functions:

Purpose: Standard autoregressive language modeling.

Formally: Next-token prediction loss.

Adaptation: Full fine-tuning of the LLM and Connector; Vision Encoder is frozen.

Trainable Parameters: Connector (MLP) and LLM parameters

Training Data:

Stage 1: LLaVA 558K image-caption pairs (filtered from CC3M)
Stage 2: LLaVA 665K conversational data, or Cambrian-1 737K, or Custom 1M dataset

Key Hyperparameters:

learning_rate_stage1: 1e-3
learning_rate_stage2: 2e-5
batch_size_stage1: 256
+ 3 more
batch_size_stage2: 128
optimizer: AdamW
scheduler: Cosine annealing

Compute: 4x NVIDIA A100 80GB GPUs; 2 hours for phase 1, 8 hours for phase 2 (for 1.4B model)

Comparison to Prior Work

vs. LLaVA-1.5: Incorporates shallow and middle layers explicitly rather than relying solely on the penultimate layer.
vs. DenseConnector/MMFuser: Uses a simpler, lightweight fusion (summation of projected features from 3 specific layers) based on empirical grouping, rather than complex dense connections or all-layer fusion.
vs. Qwen-VL/InternVL [not cited in paper]: Qwen-VL uses the final layer; this paper shows the final layer degrades performance compared to the penultimate and proposes multi-layer fusion.

Limitations

Analysis is primarily based on CLIP-ViT; findings may vary for other vision encoders like SigLIP or DINOv2.
Fusion strategy is simple addition; more complex attention-based fusion might yield better results (though at higher cost).
Gains from shallow layers diminish as the LLM scale increases (7B vs 1.4B), suggesting larger LLMs can better extract information from deep layers.

Reproducibility

Code: https://github.com/EIT-NLP/VisualProbing-for-MLLM

Code is publicly available at https://github.com/EIT-NLP/VisualProbing-for-MLLM. Datasets used (LLaVA-1.5, Cambrian-1) are public. Model weights are standard open-source models (CLIP, MobileLLaMA, Vicuna).

📊 Experiments & Results

Evaluation Setup

Evaluation across 10 benchmarks covering General, OCR, Vision-centric, and Hallucination tasks.

Benchmarks:

MME (General perception/cognition (Yes/No))
MMBench (Multiple choice general QA)
TextVQA (OCR / Text reading)
POPE (Hallucination evaluation)
CVBench (2D/3D perception)

Metrics:

Accuracy
Score (Benchmark specific)
F1 score (for POPE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Layer-wise performance analysis (1.4B model) comparing specific single layers.
CVBench	Accuracy	57.3	60.3	+3.0
RefCOCO	Score	78.4	79.8	+1.4
MME-Cognitive	Score	301.8	321.8	+20.0
MME-Position	Score	100.0	120.0	+20.0
Comparison of the proposed fusion method against baselines and other fusion strategies (1.4B model).
TextVQA	Accuracy	45.7	48.2	+2.5
MME	Score	1290	1342	+52
POPE	Accuracy	85.7	86.6	+0.9

Experiment Figures

Radar charts comparing the performance of Shallow (L3), Middle (L18), Deep (L23), and Final (L24) layers across multiple benchmarks.

Detailed breakdown of MME subtask performance by layer depth.

Main Takeaways

The final layer of CLIP-ViT is consistently suboptimal across tasks due to over-alignment with text (contrastive loss) at the expense of local visual details.
The penultimate layer is the best single deep layer, balancing visual detail and text alignment, but middle layers (e.g., L18) significantly outperform it on counting, positioning, and existence tasks.
Increasing LLM size (1.4B -> 7B) or data scale (558k -> 1M) does not fully compensate for the visual information loss in deep layers; middle layers retain an advantage in vision-centric subtasks.
A simple linear fusion of one layer from each group (Shallow, Middle, Deep) achieves state-of-the-art performance compared to complex fusion modules, validating the layer grouping hypothesis.

📚 Prerequisite Knowledge

Prerequisites

Architecture of Vision Transformers (ViT) and CLIP
Multimodal LLM components (connector, LLM, vision encoder)
Layer-wise representation analysis (cosine similarity)

Key Terms

ViT: Vision Transformer—a neural network that processes images by dividing them into patches and processing them with transformer blocks.

CLIP: Contrastive Language-Image Pre-training—a model trained to align image and text representations, commonly used as the vision encoder in MLLMs.

MLLM: Multimodal Large Language Model—an AI system capable of processing and generating both text and images (e.g., GPT-4V, LLaVA).

Penultimate layer: The second-to-last layer of a neural network; often used in ViT feature extraction to avoid over-fitting to the specific pre-training objective of the final layer.

Linear probing: A technique to analyze representations by training a simple linear classifier on top of frozen features.

POPE: A benchmark for evaluating object hallucination (seeing things that aren't there) in MLLMs.

MME: A comprehensive evaluation benchmark for MLLMs covering perception and cognition tasks.

OCR: Optical Character Recognition—the task of recognizing and reading text embedded within images.

Visual grounding: The ability of a model to locate and refer to specific objects within an image.