OmniCaptioner: One Captioner to Rule Them All

📝 Paper Summary

Image Captioning Multimodal Large Language Models (MLLMs) Visual Reasoning

OmniCaptioner is a unified framework trained on 21 million diverse image-text pairs that converts visual details into dense text, enabling standard LLMs to solve visual reasoning tasks without multimodal training.

Core Problem

Current MLLMs struggle with domain gaps (e.g., between natural images and charts/UIs) and often lag behind text-only LLMs in complex reasoning capabilities.

Why it matters:

Specialized captioners lack versatility, failing to handle the variety of visual inputs (charts, posters, geometry) needed for general-purpose assistants
State-of-the-art reasoning models (like DeepSeek-R1) are text-only; bridging the visual gap without expensive multimodal retraining allows leveraging their superior reasoning power
Inaccurate or sparse captions limit the performance of downstream tasks like text-to-image generation and supervised fine-tuning

Concrete Example: When asking a standard MLLM (like LLaVA-OneVision) to convert a chart to Markdown, it may hallucinate values due to poor domain alignment. Similarly, asking a geometry question requires reasoning that visual encoders often fail to capture, whereas a precise text description allows a math-optimized LLM to solve it.

Key Novelty

Unified Multi-Domain Captioning as a Universal Connector

Constructs a massive, diverse dataset (21M) covering natural images, visual text (posters, UIs), and structured visuals (charts, equations) using a two-stage generation pipeline (Seed + Extension)
Treats visual reasoning as a text-only task by converting images into dense, fine-grained textual descriptions (Pixel-to-Text Mapping), decoupling perception from reasoning
Demonstrates that feeding these detailed captions into strong reasoning LLMs (like DeepSeek-R1) achieves state-of-the-art visual reasoning without updating the LLM weights

Architecture

The dataset construction and training pipeline for OmniCaptioner

Evaluation Highlights

Achieves 40.5% on MathVerse with OmniCaptioner + DeepSeek-R1-Distill-Qwen-7B, significantly outperforming the multimodal Qwen2-VL-7B (31.9%) without visual encoder training
Surpasses LLaVA-OneVision-7B on captioning metrics (BLEU: 22.35 vs 14.18) and human preference (56.7% win rate on non-natural images)
Improves Text-to-Image generation on GenEval (Overall score 67.58) compared to using standard Qwen2-VL captions (65.27) or the base SANA model (64.61)

Breakthrough Assessment

8/10

Strong evidence that dense captioning can effectively bridge the gap between vision and reasoning LLMs, outperforming native MLLMs in specific reasoning tasks. The unified dataset approach is highly practical.

⚙️ Technical Details

Problem Definition

Setting: Generating fine-grained textual descriptions $T$ for a visual input $I$ across diverse domains $D$ (natural, structured, visual-text).

Inputs: Image $I$ (natural, chart, document, or UI screenshot)

Outputs: Detailed caption $T$ describing visual elements, spatial relationships, and semantic content (or code like LaTeX/Markdown)

Pipeline Flow

Data Construction: Seed Caption Generation (GPT-4o/Rule-based) -> Caption Extension (Qwen2.5/Qwen2-VL)
Unified Pretraining: Fine-tune Qwen2-VL-Instruct on diverse 21M dataset

System Modules

Seed Caption Generator (Data Construction)

Generate initial accurate descriptions or code representations

Model or implementation: GPT-4o (for natural/visual-text) or Rule-based code (for structured)

Caption Extender (Data Construction)

Refine and diversify captions (style, length, language, reasoning)

Model or implementation: Qwen2.5-32B (for text) and Qwen2-VL-76B (for CoT-style reasoning)

OmniCaptioner Model

Generate domain-adaptive captions for arbitrary visual inputs

Model or implementation: Qwen2-VL-Instruct (initialized weights)

Novel Architectural Elements

Unified pretraining paradigm utilizing distinct system prompts to differentiate between captioning styles (CoT, brief, detailed) and domains within a single model weights set

Modeling

Base Model: Qwen2-VL-Instruct (initialized from these weights)

Training Method: Supervised Fine-Tuning (SFT) on the constructed caption dataset

Training Data:

21M total image-caption pairs
Categories: Natural Images, Structured Images (Charts, Tables), Visual Text (Posters, UI), Video

Compute: Not reported in the paper

Comparison to Prior Work

vs. ShareGPT4V: OmniCaptioner covers non-natural domains (charts, UI, geometry) unlike ShareGPT4V's focus on natural images
vs. LLaVA-OneVision: OmniCaptioner provides a standalone captioning capability that empowers external LLMs, whereas LLaVA-OV is an end-to-end MLLM

Limitations

Dependency on closed-source models (GPT-4o) for high-quality seed caption generation
Inference latency increases when using the two-step approach (Caption Generation + LLM Reasoning) compared to end-to-end MLLMs
The paper does not report specific training compute resources (GPU hours)

Reproducibility

Code: https://github.com/Alpha-Innovator/OmniCaptioner

Code is publicly available on GitHub. Pretrained models are available on HuggingFace. The dataset construction pipeline uses closed-source GPT-4o, which may limit exact dataset reproduction without API access.

📊 Experiments & Results

Evaluation Setup

Evaluated on Image Captioning quality, Visual Reasoning (via caption-prompted LLMs), Text-to-Image Generation, and SFT efficiency.

Benchmarks:

MME (Multimodal evaluation)
MMMU (Multi-discipline multimodal reasoning)
MathVerse (Visual math problem solving)
MathVision (Visual math problem solving)
GenEval (Text-to-Image generation evaluation)

Metrics:

BLEU
CLIPScore
CAPTURE
Accuracy (for reasoning benchmarks)
GenEval Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Caption quality evaluation showing OmniCaptioner outperforms baselines on automated metrics.
Standard Caption Metrics	BLEU	14.18	22.35	+8.17
Standard Caption Metrics	CLIPScore	30.12	34.05	+3.93
Visual reasoning experiments where OmniCaptioner captions are fed to LLMs (DeepSeek-R1 Distilled variants), compared against native MLLMs.
MathVerse	Accuracy	31.9	40.5	+8.6
MMMU	Accuracy	64.5	64.6	+0.1
Text-to-Image generation improvements using OmniCaptioner captions to train SANA-1.0.
GenEval	Overall Score	64.61	67.58	+2.97
SFT Efficiency experiments comparing OmniCaptioner pretraining vs standard Qwen2-VL base on downstream SFT tasks.
MathVista	Score	56.1	57.4	+1.3

Experiment Figures

Qualitative comparison of captions and downstream applications

Main Takeaways

Detailed, pixel-grounded captions allow text-only LLMs (especially reasoning models like DeepSeek-R1) to perform visual reasoning at or above the level of specialized MLLMs without parameter updates
Unified pretraining on diverse domains (charts, UI, natural images) creates a more robust foundation for downstream SFT, achieving competitive results with significantly less SFT data (1.6M vs 3.2M)
Accurate captioning of non-natural images (structured/visual-text) significantly boosts text-to-image generation fidelity, reducing hallucinations in generated outputs

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with Image Captioning metrics (BLEU, CLIPScore)
Knowledge of Supervised Fine-Tuning (SFT) for LLMs

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset to adapt it to specific tasks

CoT: Chain-of-Thought—a reasoning method where the model generates intermediate steps before the final answer

T2I: Text-to-Image—generation tasks where a model creates an image based on a textual description

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which counts matching n-grams between candidate and reference

CLIPScore: A metric that measures the semantic similarity between an image and a text caption using the CLIP model embeddings

DeepSeek-R1: A series of reasoning-oriented Large Language Models known for strong performance in logic and mathematics

Visual Text Images: Images containing significant textual information, such as posters, user interfaces (UI), and textbook pages

Structured Images: Images representing structured data, such as geometric diagrams, mathematical equations, tables, and charts