CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

📝 Paper Summary

Image Captioning Reinforcement Learning with Verifiable Rewards (RLVR) Vision-Language Pretraining

CapRL trains an image captioner using reinforcement learning where the reward is the ability of a blind LLM to answer questions about the image solely based on the generated caption.

Core Problem

Supervised Fine-Tuning (SFT) for image captioning relies on expensive, non-scalable human data and causes models to memorize specific ground-truth phrasing rather than learning to generate diverse, dense descriptions.

Why it matters:

SFT models struggle to generate the wide variety of valid descriptions possible for a single image, limiting their generality.
Existing RL approaches use subjective rewards (like LLM-as-a-judge) that are prone to reward hacking (e.g., maximizing verbosity) or require expensive reference-based metrics that fail on long captions.
Dense, accurate captions are critical for pre-training Large Vision-Language Models (LVLMs) to align visual and linguistic domains effectively.

Concrete Example: A captioning model trained with SFT might output a short, memorized phrase like 'a dog on grass' for a complex scene. In contrast, CapRL forces the model to include details like 'red frisbee' because a verifier asks 'What color is the frisbee?', and the caption must contain the answer for the blind LLM to get it right.

Key Novelty

Captioning Reinforcement Learning (CapRL) with Perception-Reasoning Decoupled Reward

Defines caption quality by utility: a good caption contains enough information for a text-only LLM (blind to the image) to correctly answer visual questions about that image.
Uses a two-stage pipeline: (1) LVLM generates a caption, (2) Blind LLM answers multiple-choice questions using only that caption. The answer accuracy serves as the objective, verifiable reward for RL training.

Architecture

The CapRL training pipeline, illustrating the two-stage decoupled reward mechanism.

Evaluation Highlights

+6.8% accuracy improvement on InfoVQA and +3.6% on ChartVQA when pretraining with CapRL-1M compared to DenseFusion-1M.
Outperforms the ShareGPT4V-1M baseline by 1.6% on MMStar and 1.8% on MMBench, showing benefits for natural images.
Achieves caption quality comparable to the much larger Qwen2.5-VL-72B model within the Prism evaluation framework, exceeding the baseline by an average margin of 8.4%.

Breakthrough Assessment

8/10

Successfully applies RLVR to a subjective task by converting it into an objective proxy task (VQA utility). The resulting dataset (CapRL-5M) yields significant gains in downstream LVLM pretraining.

⚙️ Technical Details

Problem Definition

Setting: Open-ended image caption generation optimized via Reinforcement Learning

Inputs: Image I and instruction

Outputs: Natural language caption c

Pipeline Flow

Policy Model (LVLM) generates captions
Reward Computation (Blind LLM answers VQA questions based on caption)
Optimization (GRPO updates Policy Model)

System Modules

Policy Model

Generate dense captions for the input image

Model or implementation: Qwen2.5-VL-3B (fine-tuned)

Verifier (Blind LLM)

Answer multiple-choice questions about the image using ONLY the generated caption as context

Model or implementation: Qwen2.5-3B-Instruct (text-only)

Novel Architectural Elements

Decoupled two-stage reward pipeline: evaluating visual caption quality via non-visual question answering utility

Modeling

Base Model: Qwen2.5-VL-3B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Calculate reward based on answer accuracy.

Formally: R(c) = (1/N) * Sum(Ind(a_m == GT_m)), where Ind is the indicator function for exact match.
Purpose: Optimize policy to maximize reward with stability.

Formally: Policy gradient update with KL-divergence penalty.

Adaptation: Full fine-tuning (implied by context of pretraining captioner)

Training Data:

Curated VQA dataset for reward calculation: ~75k images with QA pairs
Images sourced from natural scenes, charts, and documents
Questions filtered to ensure they are strictly visually grounded (answerable only via image content)

Key Hyperparameters:

reward_sampling_N: Not explicitly reported in the paper
group_size_G: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ShareGPT4V: CapRL uses RLVR to optimize for utility rather than imitating GPT-4V outputs via SFT.
vs. LLM-as-a-judge: CapRL avoids subjective scoring biases (verbosity) by using objective QA accuracy as the reward signal.

Limitations

Relies on the quality of the VQA question bank; poor questions lead to poor reward signals.
The reward proxy (QA accuracy) might not capture all aspects of caption quality, such as style or fluency, only informational utility.
Computational cost of generating QA pairs for the reward training phase is non-trivial (requires Qwen2.5-VL-72B).

Reproducibility

Code: https://github.com/InternLM/CapRL

Code is publicly available (https://github.com/InternLM/CapRL). The CapRL-5M dataset construction involves filtering 3M web images and combining with ShareGPT4V-1M and DenseFusion-1M. The reward model uses Qwen2.5-3B-Instruct.

📊 Experiments & Results

Evaluation Setup

Multimodal Pretraining and Downstream Evaluation. Models are pretrained on CapRL datasets and then fine-tuned on Open-LLaVA-NeXT-1M.

Benchmarks:

InfoVQA (Document Visual Question Answering)
ChartVQA (Chart Visual Question Answering)
DocVQA (Document Understanding)
MMBench (General Multimodal Benchmark)
MMStar (General Multimodal Benchmark)

Metrics:

Accuracy
Prism Framework Score (Informativeness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Qwen2.5-3B + Qwen2.5-ViT pretrained on different 1M datasets shows CapRL-1M superiority.
InfoVQA	Accuracy	44.6	51.4	+6.8
ChartVQA	Accuracy	58.1	61.7	+3.6
DocVQA	Accuracy	66.5	69.2	+2.7
MMBench	Accuracy	65.3	67.1	+1.8
MMStar	Accuracy	39.8	41.4	+1.6
Prism evaluation demonstrates that CapRL produces more informative captions than baselines.
Prism Score	Average Score	Not reported in the paper	Not reported in the paper	+8.4

Experiment Figures

Training reward curves comparing CapRL against Reward Model and LLM-as-a-judge approaches.

Qualitative comparison of captions generated by Qwen2.5-VL-3B with and without CapRL training.

Main Takeaways

CapRL significantly improves performance on document and chart domains (InfoVQA, ChartVQA), suggesting the RL objective forces the model to capture dense, structured text information.
Scaling the dataset from 1M to 5M (CapRL-5M) consistently improves performance across all 12 benchmarks evaluated.
Controlled ablation with fixed images confirms that the quality of captions generated by CapRL-3B is superior to ShareGPT4V and DenseFusion captions for downstream pretraining.
The method avoids reward hacking common in subjective RL tasks by anchoring the reward to a verifiable QA task.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO or GRPO)
Vision-Language Models (LVLMs)
Visual Question Answering (VQA)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—a paradigm where models are trained using objective, binary success signals (like math correctness) rather than subjective human preference.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs from the same input within a group, often removing the need for a separate value network.

LVLM: Large Vision-Language Model—a model capable of processing and generating both text and images.

SFT: Supervised Fine-Tuning—training a model to mimic specific ground-truth outputs provided in a dataset.

LLM-as-a-judge: Using a Large Language Model to evaluate the quality of text outputs, often used as a reward signal in RL but prone to biases.

VQA: Visual Question Answering—the task of answering questions about an image.

Prism Framework: An evaluation framework for image captioning that assesses quality based on informativeness and hallucination rates.

KL-divergence penalty: A regularizer used in RL to prevent the trained policy from deviating too far from the reference model's behavior.