Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

📝 Paper Summary

Reinforcement Learning for MLLMs Synthetic Data Generation

Syn-GRPO overcomes data quality bottlenecks in MLLM perception training by coupling an online image synthesis server with a GRPO workflow that rewards response diversity.

Core Problem

Existing RL methods for MLLM perception suffer from low data quality, where static training samples fail to elicit diverse responses, leading to entropy and diversity collapse during training.

Why it matters:

Low data diversity restricts the exploration scope of reinforcement learning, causing the model to converge prematurely to narrow solutions
Visual perception tasks have inherent verifiable labels but often lack the complexity needed to drive deep reasoning chains in MLLMs
Standard entropy regularization techniques (like clipping) mitigate symptoms but do not address the root cause of insufficient data variety

Concrete Example: In visual reasoning, a standard image might always yield simple, uniform descriptions from the model. Without variation, the RL agent quickly learns a fixed pattern (collapsing entropy) rather than exploring better reasoning paths, limiting performance gains.

Key Novelty

Self-Evolving Data Synthesis (Syn-GRPO)

Integrates an asynchronous data server that generates new training images on-the-fly by modifying backgrounds while preserving foreground objects (labels)
Introduces a diversity reward that encourages the MLLM to predict image descriptions that will yield diverse future responses, rather than just accurate ones
Uses a diversity smoothing mechanism to calibrate rewards against the model's evolving diversity baseline, preventing distribution drift during training

Architecture

The overall framework of Syn-GRPO, illustrating the interaction between the Data Server and the GRPO Workflow.

Evaluation Highlights

Outperforms Visual-RFT by +3.4% accuracy on RefCOCOg (REC task) using Qwen2-VL-2B
Achieves +3.9% mAP improvement over Visual-RFT on OVD task (COCO2017) with Qwen2-VL-2B
Demonstrates sustained performance gains as dataset size increases, unlike baselines that plateau

Breakthrough Assessment

8/10

Significantly addresses the 'data wall' in RL for MLLMs by closing the loop between reasoning and data generation. The decoupling of synthesis and training is a strong engineering contribution.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning for multimodal large language models (MLLMs) on visual perception tasks with verifiable labels

Inputs: Image-text pairs (q, image) where q is a query (e.g., 'find the dog')

Outputs: Reasoning chain, final answer (e.g., bounding box), new image description d, and predicted diversity score v

Pipeline Flow

Data Server: Synthesizes new images asynchronously
GRPO Workflow: MLLM predicts descriptions and diversity scores, optimizing via rewards

System Modules

Data Server

Synthesize new training images while preserving labels

Model or implementation: Flux-fill-dev (Outpainting model) + RMA (Foreground segmentation)

MLLM Policy

Generate reasoning, answer, new image description, and diversity prediction

Model or implementation: Qwen2-VL-2B / Qwen2-VL-7B

Novel Architectural Elements

Asynchronous Data Server integration decoupled from the main RL loop via unified API
Augmented MLLM output head that predicts 'new image description' and 'diversity score' alongside standard reasoning

Modeling

Base Model: Qwen2-VL-2B and Qwen2-VL-7B

Training Method: Group Relative Policy Optimization (GRPO) with online data synthesis

Objective Functions:

Purpose: Optimize policy to maximize rewards.

Formally: GRPO objective maximizing E[min(ratio * A, clip(ratio) * A)] - beta * KL
Purpose: Reward accurate perception.

Formally: R_acc (IoU for REC, mAP for OVD)
Purpose: Reward diverse data generation capabilities.

Formally: R_diversity = 1 - |v_pred - V_tilde(q)|, where V_tilde is smoothed ground-truth diversity
Purpose: Enforce output format.

Formally: R_format (binary 1/0)

Adaptation: Full fine-tuning

Training Data:

RefCOCOg (REC)
COCO2017 (OVD)
3D-FRONT (ISR)

Key Hyperparameters:

learning_rate: 1e-6 (2B), 5e-7 (7B)
batch_size: Not explicitly reported in the paper
group_size_G: 8
+ 2 more
beta_smooth_weight: 0.9
kl_coefficient: 0.01

Compute: Experiments run on H800 GPUs

Comparison to Prior Work

vs. Visual-RFT: Syn-GRPO adds online data synthesis and diversity rewards, whereas Visual-RFT uses static data.
vs. DAPO: Syn-GRPO addresses entropy collapse via data quality (synthesis) rather than just loss constraints.
vs. R-Zero [not cited in paper]: Syn-GRPO generates images online for perception, whereas R-Zero focuses on text/math reasoning data synthesis.

Limitations

Relies on the quality of the external image generation model (Flux-fill-dev); artifacts could hurt training.
Computational overhead of online image generation, though mitigated by asynchronous server design.
Limited to visual perception tasks with verifiable spatial labels (bounding boxes); extension to abstract VQA is less direct.

Reproducibility

Code: https://github.com/hqhQAQ/Syn-GRPO

Code is publicly available at https://github.com/hqhQAQ/Syn-GRPO. The paper uses open-source models (Qwen2-VL, Flux-fill-dev, RMA). Hyperparameters for specific tasks are listed in Appendix.

📊 Experiments & Results

Evaluation Setup

Visual perception tasks (REC, OVD, ISR) evaluated on standard benchmarks

Benchmarks:

RefCOCOg (Referring Expression Comprehension (REC))
COCO2017 (Open-Vocabulary Object Detection (OVD))
3D-FRONT (Indoor Scene Refinement (ISR))

Metrics:

Accuracy (Acc@0.5)
Mean Average Precision (mAP)
Aesthetic Score (for ISR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing Syn-GRPO against SFT and Visual-RFT baselines across three tasks.
RefCOCOg (REC)	Acc@0.5	83.2	86.6	+3.4
COCO2017 (OVD)	mAP	50.1	54.0	+3.9
3D-FRONT (ISR)	Aesthetic Score	5.62	5.71	+0.09
RefCOCOg (REC)	Acc@0.5	83.2	86.6	+3.4

Experiment Figures

Line charts showing the rapid decline (collapse) of Entropy and Diversity during standard GRPO training on visual tasks.

Bar chart comparing performance of Syn-GRPO vs Visual-RFT at different data scales (10%, 50%, 100%).

Main Takeaways

Syn-GRPO significantly outperforms standard GRPO (Visual-RFT) across all tested visual perception tasks.
The method prevents diversity collapse: diversity metrics remain stable or improve, unlike baselines where they plummet.
Generated data becomes increasingly complex and diverse over training iterations, suggesting true self-evolution.
Scalability: The performance gap between Syn-GRPO and baselines widens as the amount of initial training data increases.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Multimodal LLMs (Visual Perception)
Generative Image Models (Inpainting/Outpainting)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to estimate advantages without a value function

REC: Referring Expression Comprehension—a task to localize an object in an image described by a text expression

OVD: Open-Vocabulary Object Detection—detecting objects in images where categories are not limited to a fixed training set

ISR: Indoor Scene Refinement—a task to refine indoor scene aesthetics based on top-view renderings

Entropy Collapse: A phenomenon where the model's output distribution becomes overly deterministic (low entropy) too quickly during RL training, hindering exploration

Diversity Drift: The shift in the ground-truth diversity distribution as the model updates, causing predicted diversity scores to become inaccurate

Foreground Consistency: Preserving the visual appearance and position of key objects (foreground) during image synthesis so that original labels (bounding boxes) remain valid