PaliGemma 2: A Family of Versatile VLMs for Transfer

📝 Paper Summary

Vision-Language Models (VLMs) Transfer Learning Model Scaling Analysis

PaliGemma 2 integrates the SigLIP vision encoder with Gemma 2 language models across varying sizes and resolutions to provide a versatile, open family of VLMs optimized for transfer learning.

Core Problem

Prior VLM research often studies scaling factors like image resolution and model size in isolation or uses disparate architectures, making it difficult to optimize compute for specific downstream tasks.

Why it matters:

Different tasks require different resources; document understanding needs high resolution, while reasoning needs larger language models
Existing open models often lack the versatility to handle niche domains (e.g., molecular graphs, music scores) without specialized architectural changes
Practitioners need a controlled 'family' of models to trade off latency, compute, and performance effectively

Concrete Example: In optical music recognition, a standard 224px resolution fails to capture fine staff lines, requiring 896px for low error rates. Conversely, spatial reasoning tasks benefit significantly from a larger 10B model but show little gain from increased resolution.

Key Novelty

PaliGemma 2 (Broad Scaling Analysis & Architecture Upgrade)

Upgrades the language decoder to the Gemma 2 family (2B, 9B, 27B) while retaining the efficient SigLIP vision encoder, creating a consistent architecture across scales
Employs a multi-stage training recipe that progressively increases resolution (up to 896px) and data complexity to prepare models for fine-grained transfer tasks
Demonstrates that a general-purpose VLM interface can replace specialized architectures in niche domains like molecular recognition and radiography

Architecture

System architecture flow: Vision Encoder (SigLIP) -> Projection -> Concatenation with Text -> Language Model (Gemma 2) -> Autoregressive Output.

Evaluation Highlights

+0.65 average score improvement on 30+ academic benchmarks for PaliGemma 2 3B (224px) compared to the original PaliGemma 3B
State-of-the-art results on OCR benchmarks (ICDAR'15, Total-Text) and table recognition (PubTabNet, FinTabNet) by scaling resolution to 896px
State-of-the-art RadGraph F1 score on MIMIC-CXR radiology report generation, outperforming baselines like R2GenGPT

Breakthrough Assessment

8/10

Provides a highly controlled, comprehensive study of VLM scaling while delivering SOTA performance on diverse, difficult specialized tasks (OCR, Chem, Med) using a unified generalist architecture.

⚙️ Technical Details

Problem Definition

Setting: Vision-Language Transfer Learning via Fine-tuning

Inputs: Image(s) and text prompt

Outputs: Text sequence (caption, answer, bounding boxes, or structured text)

Pipeline Flow

Vision Encoder (Processes image)
Projector (Maps vision features to LM space)
Language Model (Generates text response)

System Modules

Vision Encoder (Input Processing)

Extract dense visual features from input images

Model or implementation: SigLIP-So400m (pretrained)

Projector (Input Processing)

Linearly map visual embeddings to the dimension of the language model

Model or implementation: Linear Projection

Language Model

Generate text response conditioned on visual and text inputs

Model or implementation: Gemma 2 (2B, 9B, or 27B)

Novel Architectural Elements

Integration of SigLIP encoder with Gemma 2 decoder family across a wide range of sizes (2B to 27B) and resolutions (224 to 896)

Modeling

Base Model: Gemma 2 (2B, 9B, 27B) coupled with SigLIP-So400m

Training Method: 3-Stage Training: (1) Multimodal Pretraining, (2) High-Res Pretraining, (3) Transfer Fine-tuning

Objective Functions:

Purpose: Maximize the likelihood of the text token sequence conditioned on the image.

Formally: Autoregressive language modeling loss (cross-entropy).

Adaptation: Full fine-tuning (Stages 1 & 2); Task-specific fine-tuning (Stage 3)

Trainable Parameters: All parameters (Vision Encoder + Projector + LLM) are trainable in Stage 1

Training Data:

Stage 1: 1 billion examples (multimodal mixture, 224px)
Stage 2: 50M examples (448px) then 10M examples (896px)

Key Hyperparameters:

optimizer: Adam
learning_rate_base: 2e-5 (PaliGemma 1 baseline)
learning_rate_scaling: 0.5x for 3B, 0.25x for 10B/28B
+ 2 more
batch_size: Not explicitly reported in the paper
resolutions: 224x224, 448x448, 896x896

Compute: Train on TPUv5e Pod slices (256-1024 chips). Stage 1 for 3B model takes ~3 days on 256 chips.

Comparison to Prior Work

vs. PaliGemma (v1): Incorporates Gemma 2, scales to 27B parameters, introduces intermediate resolutions
vs. LLaVA: Uses SigLIP vision encoder instead of CLIP; training data does not rely on large commercial VLMs (GPT-4) [not cited in paper context but implied by 'none uses a large commercial VLM']
vs. HTS (OCR SOTA): Achieves better performance using a general-purpose architecture without task-specific OCR modules

Limitations

PaliGemma 2 28B shows diminishing returns compared to 10B, possibly due to Gemma 2 27B being trained from scratch rather than distilled.
Requires high resolution (896px) for best performance on document tasks, which increases compute cost significantly.
Increasing model size requires careful tuning of learning rates (smaller LR for larger models).

Reproducibility

Code: https://huggingface.co/spaces/big-vision/paligemma

Publicly available: Open-weights for PaliGemma 2 models (3B, 10B, 28B) and code via HuggingFace/Big Vision. Missing: Exact batch sizes and specific dataset details for internal mixtures (though components like WebLI are known from prior work).

📊 Experiments & Results

Evaluation Setup

Transfer learning via full fine-tuning on downstream tasks.

Benchmarks:

30+ Academic Benchmarks (Various (Captioning, VQA, Segmentation))
HierText / TextOCR (OCR / Text Detection)
PubTabNet / FinTabNet (Table Structure Recognition)
MIMIC-CXR (Radiography Report Generation)

Metrics:

Average Score (across 30+ tasks)
F1 Score (OCR, RadGraph)
TEDS (Table Recognition)
Exact Match (Molecules)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General transfer performance comparisons show PaliGemma 2 improvements over its predecessor across a broad suite of tasks.
30+ Academic Benchmarks (Avg)	Average Score	Not reported in the paper	Not reported in the paper	+0.65
30+ Academic Benchmarks (Avg)	Average Score	Not reported in the paper	Not reported in the paper	+0.85

Experiment Figures

Relative improvement in transfer metrics when scaling either model size (y-axis) or resolution (x-axis) starting from a 3B 224px baseline.

Normalized task performance heatmaps as a function of transfer learning rate for different model sizes.

Main Takeaways

Scaling resolution (224px → 448px) provides roughly equivalent compute increase (4.6x FLOPs) to scaling model size (3B → 10B, 3.7x FLOPs) but benefits different tasks.
Document and text-heavy tasks (OCR, Tables, Music) benefit primarily from increased resolution (up to 896px).
Reasoning and VQA tasks benefit primarily from increased language model size (up to 10B/27B).
Larger models generally require lower learning rates for optimal transfer performance.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Encoders/Decoders)
Vision-Language Pretraining concepts
Basic understanding of OCR and document processing tasks

Key Terms

SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive vision encoder used to extract image features

VLM: Vision-Language Model—a model that processes both images and text to generate text outputs

FSDP: Fully Sharded Data Parallel—a memory-efficient training strategy that shards model parameters across devices

OCR: Optical Character Recognition—converting images of text into machine-encoded text

RadGraph F1: A metric for evaluating radiology reports by comparing the overlap of clinical entities and relations in the generated vs. reference text

TEDS: Tree Edit Distance Similarity—a metric for evaluating table recognition by comparing the tree structure of HTML outputs

Logits soft-capping: A technique to constrain the magnitude of logits in the attention mechanism to improve training stability

SMILES: Simplified Molecular Input Line Entry System—a string notation for representing chemical structures

IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box