Enhancing Large Vision-Language Models with Ultra-Detailed Image Caption Generation

📝 Paper Summary

Data-centric AI Vision-Language Pre-training Synthetic Data Generation

A scalable pipeline generates ultra-detailed image captions by combining visual expert tools with LLM expansion, then refining them via human-corrected preference optimization and a critic-rewrite mechanism.

Core Problem

Existing image caption datasets for training Large Vision-Language Models (LVLMs) lack sufficient fine-grained details, limiting model capabilities in object attributes, relationships, and intricate visual reasoning.

Why it matters:

Current datasets (COCO, LAION) are too brief, causing LVLMs to miss visual nuances and fail at complex reasoning tasks
High-quality, detailed captions are essential for modality alignment but are scarce and expensive to annotate manually at scale
Models trained on coarse data suffer from hallucinations and poor generalization in dense visual scenes

Concrete Example: For an image of a sculpture, standard models might output 'A blue metal sculpture in a plaza.' The proposed method generates a paragraph detailing the 'interwoven blue metal rings,' 'high gloss finish reflecting light,' 'square base with black railings,' and 'visitors in dark and light green coats.'

Key Novelty

UltraCaption Pipeline

Multi-stage generation: Uses visual expert tools (OCR, detection) to seed GPT-4o, then expands captions via LLM-driven prompts covering 8 descriptive dimensions
Post-processing refinement: Trains a captioner using DPO (Direct Preference Optimization) on human-corrected data to fix hallucinations
Fine-grained Critic: Decomposes captions into atomic sentences, critiques each using a learned model, and rewrites the final caption to maximize factual accuracy

Architecture

The complete two-stage pipeline: Pre-processing (Visual Tools -> GPT-4o -> LLM Expansion) and Post-processing (Human Correction -> DPO -> Critic-Rewrite).

Evaluation Highlights

+3.4% accuracy improvement on MMBench-CN when training LLaVA-1.5 with the proposed data compared to standard training
+13.1% accuracy gain on TextVQA (OCR-heavy task) for LLaVA-1.5, showing the benefit of incorporating specific OCR tools in the pipeline
Outperforms ShareGPT4V-trained models on 7 out of 9 benchmarks, including reducing hallucination rates on POPE by +1.7%

Breakthrough Assessment

8/10

Offers a comprehensive, scalable solution to the data bottleneck in LVLMs. The combination of expert tools, LLM expansion, and DPO-based refinement is robust and yields significant empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Generating high-quality, ultra-detailed text descriptions $y$ for images $x$ to serve as pre-training data for LVLMs

Inputs: Raw images from various datasets (COCO, SAM, LAION, etc.)

Outputs: Ultra-detailed captions covering objects, attributes, spatial relations, and OCR text

Pipeline Flow

Group: Pre-processing (Data Construction) -> Visual Extraction -> LLM Expansion -> Seed Generation
Group: Captioner Training -> Training proprietary captioner on seed data
Group: Post-processing (Refinement) -> Active Learning Filter -> Human Correction -> DPO -> Critic-Rewrite

System Modules

Visual Expert Tools (Pre-processing)

Extract raw visual facts to ground the generation

Model or implementation: RAM++ (tags), GroundingDINO (boxes), PaddleOCR (text)

LLM Expander (Pre-processing)

Identify missing details across 8 dimensions (e.g., spatial layout, lighting) to prompt comprehensive descriptions

Model or implementation: Qwen2-7B-Instruct

Proprietary Captioner

Scale up caption generation without API costs

Model or implementation: Qwen2VL-7B

Sentence-Level Critic (Post-processing)

Verify individual facts in generated captions

Model or implementation: Qwen2VL-7B (Fine-tuned)

Rewriter (Post-processing)

Regenerate caption based on critiques to remove hallucinations

Model or implementation: LLM (implied Qwen2-7B or similar)

Novel Architectural Elements

Critic-Rewrite Pipeline: Decomposes captions into atomic sentences for individual verification before rewriting, rather than critiquing the whole paragraph at once
BCO-augmented DPO: Adds specific loss terms (BCO and normalized SFT) to standard DPO to prevent reward collapse and repetition in caption generation

Modeling

Base Model: Qwen2VL-7B (for Captioner and Critic)

Training Method: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Standard preference optimization.

Formally: L_DPO = -log sigmoid(beta * log(pi_theta/pi_ref) - ...)
Purpose: Prevent mode collapse/repetition during DPO.

Formally: L_SFT = -log pi_theta(y_c|x)
Purpose: Stabilize rewards by anchoring positive samples.

Formally: L_BCO involves log sigmoid terms with a shift parameter delta
Purpose: Combined Loss.

Formally: L = alpha1*L_DPO + alpha2*L_SFT + alpha3*L_BCO

Training Data:

320K high-quality seed captions for SFT
70K human-corrected preference pairs for DPO
120K atomic sentence critiques for Critic training

Key Hyperparameters:

learning_rate: 1e-5 (SFT), 5e-6 (DPO)
batch_size: 128 (SFT), 64 (DPO)
epochs: 1
+ 3 more
image_resolution: 1024
kl_penalty_beta: 0.1
loss_weights: alpha1=0.8, alpha2=1.0, alpha3=0.2

Compute: 64 x Ascend 910b NPUs

Comparison to Prior Work

vs. ShareGPT4V: Incorporates specific visual experts (OCR, detection) and active learning human correction, resulting in lower hallucination rates
vs. RLAIF-V: Uses human correction rather than AI feedback for DPO to avoid inheriting AI biases in complex counting/spatial tasks
vs. InternVL2-8B-MPO [not cited in paper]: Both modify DPO loss to prevent degradation, but this paper specifically adds BCO and SFT loss to fix repetition issues in captioning

Limitations

Pipeline is complex and multi-stage, making it harder to deploy than a single end-to-end model
Currently limited to static images; video modality not yet supported
Relies on GPT-4o for initial seed data generation, which has associated costs
Relation scoring in benchmarks lags slightly behind GPT-4o, possibly due to perspective mismatches (subject vs. viewer)

Reproducibility

Code: https://github.com/yuzeng0-0/UltraCaption

Code and dataset available at https://github.com/yuzeng0-0/UltraCaption. Uses proprietary GPT-4o for seed data generation. Qwen2VL-7B and LLaVA models used are open source.

📊 Experiments & Results

Evaluation Setup

Pre-training LVLMs (LLaVA-1.5, LLaVA-NEXT) on the generated dataset and evaluating on downstream VQA and reasoning benchmarks

Benchmarks:

MME (Perception and Cognition Evaluation)
MMBench (MMB) (Multi-modal reasoning)
TextVQA (OCR-based Question Answering)
POPE (Object Hallucination Evaluation)
CompreCap (Image Captioning Benchmark)

Metrics:

Accuracy
Score (MME)
F1 Score (POPE)
Object/Pixel Coverage (CompreCap)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LLaVA-1.5 trained with UltraCaption data consistently outperforms the baseline LLaVA-1.5 and competitive ShareGPT4V across multiple benchmarks.
MMBench-CN	Accuracy	31.1	37.9	+6.8
TextVQA	Accuracy	58.2	61.7	+3.5
POPE	Accuracy (Hallucination)	68.4	70.1	+1.7
Ablation studies on the caption generation pipeline show that post-processing (DPO + Critic) improves caption quality metrics.
CompreCap	Object Coverage (%)	71.97	75.96	+3.99
Manual Quality Analysis	Hallucination Rate (%)	52.5	73.0	+20.5

Experiment Figures

Qualitative comparison of captions generated by GPT-4o, Qwen2-VL, and the proposed method (Ours), with a bar chart counting key details.

Main Takeaways

Integrating specific vision tools (OCR, Detection) before caption generation significantly boosts performance on fine-grained tasks like TextVQA.
Human-corrected DPO is superior to purely AI-based feedback for caption refinement, particularly for avoiding hallucinations in complex scenes.
The 'Atomic Sentence' critique strategy effectively isolates and fixes factual errors that might be missed when critiquing a full paragraph.
Scalability is achieved by training a smaller proprietary captioner (Qwen2VL-7B) on the high-quality seed data, removing dependency on GPT-4o for mass generation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (e.g., LLaVA, CLIP)
Familiarity with Instruction Tuning and RLHF/DPO concepts
Knowledge of object detection and OCR tools

Key Terms

LVLM: Large Vision-Language Model—AI models that can process and reason about both images and text

DPO: Direct Preference Optimization—a method to align models to preferences (like human corrections) without training a separate reward model

Atomic Sentence: A short, independent sentence describing a single specific fact or object in an image, used for precise verification

Hallucination: When a model generates text describing objects or details that are not actually present in the image

OCR: Optical Character Recognition—technology to detect and convert text within images into machine-readable text

SFT: Supervised Fine-Tuning—training a model on labeled examples (image-caption pairs)

Grounding DINO: An open-set object detection model that can find arbitrary objects specified by text prompts

RAM++: Recognize Anything Model—a strong image tagging model used to extract object labels

KL penalty: A regularizer used in RL/DPO to prevent the trained model from deviating too drastically from the reference model