ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

📝 Paper Summary

Vision-Language Alignment Data Engineering for Multi-Modal Models

ShareGPT4V introduces a large-scale dataset of highly descriptive captions generated by GPT4-Vision and a specialized captioner, demonstrating that high-quality textual descriptions significantly improve multi-modal model performance compared to standard brief captions.

Core Problem

Existing Large Multi-Modal Models (LMMs) suffer from sub-optimal modality alignment because mainstream image-text datasets use simplistic, brief captions that lack fine-grained semantics and world knowledge.

Why it matters:

Vision is inherently rich in information, but standard short captions (like COCO) reduce this richness to simple object lists, losing spatial, aesthetic, and attribute details.
Brief captions constrain the model's ability to align visual features with complex language understanding, limiting performance on tasks requiring detailed reasoning or world knowledge.
Prior data enhancement methods (like LaCLIP) rely on text-only LLMs to hallucinate details from short captions rather than 'seeing' the image, leading to inaccuracies.

Concrete Example: A standard dataset might describe the Eiffel Tower simply as 'a tall iron tower' or a picture of Einstein as 'an old man.' ShareGPT4V, by contrast, includes the specific name, location, historical context, and aesthetic qualities, enabling the model to learn specific world knowledge.

Key Novelty

ShareGPT4V Dataset and Captioner

Uses GPT4-Vision with data-specific prompts to generate 100K highly descriptive captions covering world knowledge, spatial relations, and aesthetics.
Trains a specialized 'Share-Captioner' on this high-quality subset to efficiently scale up caption generation to 1.2 million images.
Demonstrates that substituting even a small fraction of SFT data with these detailed captions yields significant performance gains across benchmarks.

Architecture

The ShareGPT4V-7B architecture, identical to LLaVA-1.5

Evaluation Highlights

+36.1 points improvement on MME perception benchmark compared to LLaVA-1.5-13B, despite using a smaller 7B model.
Achieves 68.8% accuracy on MMBench, surpassing the second-best model by 1.1%.
Surpasses Qwen-VL-Chat-7B (trained on 1.4 billion samples) by 95.6 points on the MME benchmark total score.

Breakthrough Assessment

9/10

Significantly advances LMM performance purely through data quality rather than architectural changes. The release of 1.2M high-quality captions resolves a major bottleneck in vision-language alignment.

⚙️ Technical Details

Problem Definition

Setting: Pre-training and Supervised Fine-Tuning (SFT) of Large Multi-Modal Models

Inputs: Images and corresponding text prompts/instructions

Outputs: Textual responses (captions, answers to questions)

Pipeline Flow

Data Generation Phase (GPT4-Vision & Share-Captioner)
Pre-training Phase (ShareGPT4V-PT)
SFT Phase (ShareGPT4V-SFT)

System Modules

Vision Encoder (Input Processing)

Extract visual features from input images

Model or implementation: CLIP-Large (336x336 resolution, patch size 14)

Projector (Input Processing)

Map visual tokens to language embedding space

Model or implementation: Two-layer MLP

Large Language Model

Generate text response based on visual and text inputs

Model or implementation: Vicuna-v1.5-7B (based on LLaMA2)

Modeling

Base Model: ShareGPT4V-7B (based on LLaVA-1.5 architecture using Vicuna-v1.5-7B)

Training Method: Supervised Fine-Tuning (SFT) and Pre-training with specialized data

Trainable Parameters: Pre-training: Vision Encoder, Projector, LLM. SFT: Projector, LLM (Vision Encoder frozen).

Training Data:

Pre-training: 1.2M captions (ShareGPT4V-PT) generated by Share-Captioner.
SFT: 665K mixture, where 23K detailed descriptions are replaced by ShareGPT4V captions.

Key Hyperparameters:

pre_training_learning_rate: 2e-5
pre_training_batch_size: 256
pre_training_steps: 4700
+ 3 more
sft_learning_rate: 2e-5
sft_batch_size: 128
sft_steps: 5200

Compute: Caption generation required ~44 A100 GPU days.

Comparison to Prior Work

vs. LLaVA-1.5: ShareGPT4V uses identical architecture but significantly better data (highly descriptive captions) and fine-tunes the vision encoder during pre-training.
vs. Qwen-VL-Chat: ShareGPT4V achieves better performance with significantly less data (1.2M vs 1.4B samples) and smaller model size (7B).
vs. LLaVA (original): LLaVA uses text-only GPT-4 to hallucinate details from bounding boxes; ShareGPT4V uses GPT4-Vision to see the actual image.

Limitations

The method relies on GPT4-Vision for the initial seed data, which is a closed-source commercial model.
The Share-Captioner's performance is upper-bounded by the quality of the 100K GPT4-Vision captions it was trained on.
Experiments focus primarily on the 7B model scale; scaling laws for larger models are not explicitly explored in depth.

Reproducibility

Code: https://ShareGPT4V.github.io

Publicly available: ShareGPT4V dataset, Share-Captioner model, and ShareGPT4V-7B model weights (https://ShareGPT4V.github.io). The dataset includes 100K GPT4-Vision captions and 1.2M Share-Captioner captions. Code is available.

📊 Experiments & Results

Evaluation Setup

Evaluation across 11 multi-modal benchmarks covering VQA, reasoning, and detailed description.

Benchmarks:

MME (Perception and Cognition evaluation)
MMBench (Vision-related reasoning and perception)
SEED (Image) (Generative evaluation across 9 dimensions)
LLaVA (In-the-wild) (Conversation, reasoning, and description)

Metrics:

Accuracy (%)
Score (Total/Perception/Cognition)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ShareGPT4V-7B consistently outperforms baselines on comprehensive LMM benchmarks like MME and MMBench.
MME	Total Score	1848.3	1943.8	+95.5
MMBench	Accuracy	67.7	68.8	+1.1
SEED (Image)	Score	68.2	69.7	+1.5
Ablation studies confirm the impact of substituting SFT data with high-quality captions.
MME (Perception)	Score	1531.3	1567.4	+36.1
MME	Score gain	Not reported in the paper	Not reported in the paper	+222.8

Experiment Figures

Performance gains from replacing SFT data with ShareGPT4V captions across multiple models (LLaVA-7B, LLaVA-1.5-13B, Qwen-VL-Chat)

Qualitative comparison of captions generated by COCO, BLIP, LLaVA-1.5, Share-Captioner, and GPT4-Vision

Main Takeaways

High-quality captions are critical: Replacing just a small fraction (3.5%) of SFT data with detailed captions leads to significant performance gains.
Modality alignment requires density: Brief captions in standard datasets lead to sub-optimal alignment; dense captions allow the model to learn fine-grained visual features.
Efficiency of data scale: ShareGPT4V-7B outperforms models trained on orders of magnitude more data (1.4B samples vs 1.2M samples) by focusing on data quality.
Vision encoder fine-tuning: Unlocking the vision encoder during pre-training is beneficial when using high-quality captions, as opposed to standard practice with lower-quality data.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture
Vision-Language Pre-training (CLIP)
Large Language Models (LLMs) and instruction tuning
Supervised Fine-Tuning (SFT)

Key Terms

LMM: Large Multi-Modal Model—an AI model capable of processing and generating content across multiple modalities, typically text and images.

SFT: Supervised Fine-Tuning—the phase where a pre-trained model is trained on labeled instruction-following data to improve its ability to perform specific tasks.

GPT4-Vision: A proprietary multimodal model from OpenAI capable of understanding and describing images with high detail.

Modality Alignment: The process of training a model so that representations from different modalities (e.g., image and text) correspond correctly to each other.

CLIP: Contrastive Language-Image Pre-training—a model trained to predict which caption goes with which image, used here as the vision encoder.

Projector: A neural network component (often an MLP) that maps visual features from the vision encoder into the embedding space of the language model.

Hallucination: When a model generates plausible-sounding but factually incorrect information not present in the source input.

Share-Captioner: The specific captioning model developed in this paper, trained on GPT4-Vision outputs to generate detailed captions at scale.