What matters when building vision-language models?

📝 Paper Summary

Vision-Language Model Architecture Multimodal Pre-training

The authors identify that fully autoregressive architectures with unfrozen backbones outperform cross-attention approaches, leading to Idefics2, an 8B parameter model achieving state-of-the-art performance in its size class.

Core Problem

Critical design decisions for Vision-Language Models (VLMs)—such as architecture type, connector design, and training stability—are often adopted without experimental justification, hindering the community's understanding of what truly drives performance.

Why it matters:

Disparate design choices (e.g., cross-attention vs. concatenation) make it difficult to attribute performance gains to specific components
Standard practices like resizing images to fixed squares distort aspect ratios, hurting performance on tasks involving text reading or fine details
Inefficient architectures result in excessively long visual token sequences, increasing compute costs and limiting context windows

Concrete Example: Standard VLMs often resize document images to low-resolution squares, making text unreadable. Idefics2 preserves the original aspect ratio and splits the image into sub-crops (e.g., 4 crops + original), allowing the model to read dense text in scanned PDFs where prior models failed.

Key Novelty

Idefics2 (Optimized Fully Autoregressive VLM)

Systematic ablation revealing that fully autoregressive architectures (concatenating visual tokens to text) outperform cross-attention architectures only when the pre-trained backbones are unfrozen (via LoRA)
Adoption of a 'split-and-crop' strategy where images are decomposed into sub-images to boost resolution for OCR tasks without altering the model signature
Use of learned pooling (Perceiver Resampler) to drastically reduce visual token count (729 to 64) while improving downstream performance

Architecture

Illustration of the fully-autoregressive architecture used in Idefics2.

Evaluation Highlights

+12.9 points average improvement across 4 benchmarks (VQAv2, TextVQA, OKVQA, COCO) when unfreezing backbones in a fully autoregressive architecture compared to freezing them
Perceiver Resampler pooling improves performance by +8.5 points while reducing visual tokens per image from 729 to 64 compared to no pooling
Replacing LLaMA-1-7B with Mistral-7B yields a +5.1 point boost; replacing CLIP-ViT-H with SigLIP-SO400M yields a +3.3 point boost

Breakthrough Assessment

8/10

Provides much-needed experimental clarity on VLM design choices (architecture trade-offs, freezing vs. unfreezing) and releases a strong 8B open model (Idefics2) that rivals much larger closed models.

⚙️ Technical Details

Problem Definition

Setting: Multimodal generative modeling where the model takes a sequence of text and images (potentially interleaved) and generates text output

Inputs: Sequence containing text tokens and images (processed into visual tokens)

Outputs: Generated text tokens (e.g., answers to questions, captions, extracted text)

Pipeline Flow

Input Processing (Image Splitting & Preserving Aspect Ratio)
Vision Encoder (SigLIP)
Modality Projection (Perceiver Resampler)
Language Model (Mistral) with LoRA

System Modules

Input Processor

Prepares images by preserving aspect ratio and optionally splitting them into sub-crops (4 crops + original) for high-resolution tasks

Model or implementation: N/A (Image processing logic)

Vision Encoder

Encodes image patches into visual features

Model or implementation: SigLIP-SO400M

Modality Projector

Pools the variable-length sequence of image features into a fixed, shorter sequence of visual tokens

Model or implementation: Perceiver Resampler

Language Model

Generates text response based on concatenated visual and text tokens

Model or implementation: Mistral-7B-v0.1

Novel Architectural Elements

Combination of fully autoregressive architecture with LoRA-unfrozen backbones (proven superior to cross-attention in this setting)
Integration of aspect-ratio preserving inputs with learned Perceiver pooling to decouple image resolution from token count

Modeling

Base Model: SigLIP-SO400M (vision) + Mistral-7B-v0.1 (text)

Training Method: Supervised fine-tuning (Pre-training followed by Instruction Fine-tuning)

Objective Functions:

Purpose: Predict the next text token in the sequence.

Formally: Standard autoregressive language modeling loss (Cross-Entropy)

Adaptation: LoRA (Low-Rank Adaptation) applied to pre-trained backbones; full fine-tuning for connector/pooling layers

Trainable Parameters: Approximately 15% of total parameters (when using LoRA)

Training Data:

Pre-training Stage 1: OBELICS (interleaved docs), PMD/LAION-COCO (pairs). Max resolution 384px.
Pre-training Stage 2: Adds PDF datasets (OCR-IDL, PDFA, Rendered Text). Max resolution 980px.
Instruction Fine-tuning: The text mentions this stage follows pre-training but details are focused on data composition (Visual Instruction Tuning)

Key Hyperparameters:

pre_training_stage_1_batch_size: 2048
pre_training_stage_1_max_seq_len_obelics: 2048
pre_training_stage_1_max_seq_len_pairs: 1536
+ 2 more
pre_training_stage_2_max_image_res: 980 pixels
pooling_visual_tokens: 64

Compute: Not reported in the paper

Comparison to Prior Work

vs. Flamingo: Idefics2 uses fully autoregressive architecture with LoRA-unfrozen backbones, whereas Flamingo uses cross-attention with frozen backbones. Idefics2 finds the former superior when unfreezing is possible.
vs. LLaVA: Idefics2 uses a Perceiver Resampler for pooling (reducing token count) and handles aspect ratios via splitting/interpolation, whereas standard LLaVA often uses MLP projection (more tokens) and square resizing.
vs. DeepSeek-VL [not cited in paper]: DeepSeek-VL also uses SigLIP and autoregressive design but uses 576 visual tokens per image; Idefics2 uses only 64 tokens (via Perceiver), emphasizing efficiency.

Limitations

The authors acknowledge that EVA-CLIP-5B (a much larger vision encoder) performed similarly to SigLIP-SO400M, suggesting the larger encoder might be under-trained.
Fully autoregressive training with unfreezing all parameters was unstable and diverged; LoRA was required to stabilize it.
No gains observed when using more than 64 visual tokens per image (with Perceiver), potentially limited by training duration or data.

Reproducibility

Code: https://huggingface.co/collections/HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe

publicly available (https://huggingface.co/collections/HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe). The authors release the base, instructed, and chat versions of Idefics2, along with the OBELICS dataset and fine-tuning data. Specific hyperparameters for the final model training (learning rates, GPU hours) are less detailed than the ablation setup.

📊 Experiments & Results

Evaluation Setup

Evaluation on downstream vision-language benchmarks using 4-shot performance (unless specified)

Benchmarks:

VQAv2 (Visual Question Answering)
TextVQA (OCR / Text reading in images)
OKVQA (External Knowledge VQA)
COCO (Image Captioning)

Metrics:

Accuracy (VQAv2, TextVQA, OKVQA)
CIDEr (COCO)
ANLS (DocVQA - in specific sub-experiments)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Backbone Ablations: Switching to stronger unimodal backbones significantly improves VLM performance.
Average (VQAv2, TextVQA, OKVQA, COCO)	Average Score	57.4	62.5	+5.1
Average (VQAv2, TextVQA, OKVQA, COCO)	Average Score	59.2	62.5	+3.3
Architecture Ablations: Comparison of Cross-Attention vs. Fully Autoregressive architectures under frozen and unfrozen (LoRA) conditions.
Average (VQAv2, TextVQA, OKVQA, COCO)	Average Score	46.6	53.6	+7.0
Average (VQAv2, TextVQA, OKVQA, COCO)	Average Score	54.2	59.5	+5.3
Average (VQAv2, TextVQA, OKVQA, COCO)	Average Score	46.6	59.5	+12.9
Pooling & Efficiency: Effect of using Learned Pooling (Perceiver) vs No Pooling.
Average (VQAv2, TextVQA, OKVQA, COCO)	Average Score	51.0	59.5	+8.5
Resolution Strategies: Impact of image splitting ('The Better Way') on text-heavy tasks.
TextVQA	Accuracy	66.5	73.0	+6.5

Main Takeaways

Fully autoregressive architectures are superior to cross-attention architectures, but only if the pre-trained backbones are allowed to adapt (e.g., via LoRA); otherwise, cross-attention is better.
Learned pooling (Perceiver Resampler) is highly effective, improving performance significantly while reducing the number of visual tokens by over 10x compared to unpooled sequences.
Preserving aspect ratio and splitting images into crops (sub-images) is critical for OCR and document understanding tasks, providing large gains on TextVQA without requiring a larger model.
Progress in VLMs is heavily driven by improvements in the unimodal base models (e.g., shifting from LLaMA to Mistral).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
Vision Transformers (ViT)
Large Language Models (LLM)
Parameter-Efficient Fine-Tuning (PEFT/LoRA)

Key Terms

VLM: Vision-Language Model—a model capable of processing and generating text based on both visual and textual inputs

fully autoregressive architecture: A VLM design where visual tokens are concatenated directly to the text embedding sequence, and the single model predicts the next token based on the entire history

cross-attention architecture: A VLM design where visual information is injected into the language model via interleaved cross-attention layers (text attends to image), rather than concatenation

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices

SigLIP: Sigmoid Loss for Language Image Pre-training—a variant of CLIP training that uses a sigmoid loss instead of softmax, often yielding better performance

Perceiver Resampler: A module that uses cross-attention with a fixed number of latent queries to pool a variable number of visual features into a fixed-length sequence

visual tokens: The vector representations of image patches or pooled image features that are processed by the language model

OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text