MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions

📝 Paper Summary

Synthetic Data Generation Reasoning in Large Language Models Vision-Language Models (VLMs)

MindGYM is a framework that enables models to synthesize their own high-quality training data by explicitly injecting structured cognitive patterns into the generation process.

Core Problem

Existing instruction datasets are labor-intensive to scale, while current synthetic methods often produce shallow or logically inconsistent data because they lack structured cognitive guidance.

Why it matters:

Manual curation of datasets like OK-VQA is expensive and hard to scale up
Self-supervised methods (e.g., MMInstruct) suffer from limited generalization and fail to produce cognitively diverse data
Reinforcement learning methods for reasoning (e.g., RL4F) incur prohibitive computational costs

Concrete Example: When asked a complex question requiring multi-step deduction, a standard model might provide a superficial answer. Standard synthetic methods often generate simple single-hop QAs that fail to teach the model how to break down the problem, whereas MindGYM forces the synthesis of an explicit 'thinking trace' alongside the answer.

Key Novelty

MindGYM (Thinking-Centric Data Synthesis)

Injects specific 'thinking priors' (breadth, depth, progression) into the prompt design to guide data generation toward cognitively rich samples
Uses a multi-stage synthesis process: generating background context → seed single-hop questions → challenging multi-hop questions via composition operators
Employs a structured learning pathway: training evolves from guided answering (with thinking traces) to autonomous solving (internalized reasoning)

Architecture

The MindGYM framework pipeline showing the transition from cognitive topics to final multi-hop QA data.

Evaluation Highlights

+16% improvement on MathVision-Mini for Qwen2.5-VL-7B using only 400 synthetic samples
Synthetic data achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources on Qwen2.5-VL-32B
Outperforms Chain-of-Thought (CoT) and Tree-of-Thoughts (ToT) baselines on six reasoning benchmarks

Breakthrough Assessment

8/10

Strong empirical results with very little data (400 samples) and a principled approach to reducing data quality variance, addressing a key bottleneck in synthetic data scaling.

⚙️ Technical Details

Problem Definition

Setting: Self-synthesis of instruction tuning data to improve reasoning capabilities in foundation models

Inputs: Meta-topics and cognitive prompts (no external labeled data)

Outputs: Synthesized dataset {(Question, Answer, Thinking_Trace)}

Pipeline Flow

Stage 1: Background Context Generation
Stage 2: Seed Single-Hop Question Synthesis
Stage 3: Challenging Multi-Hop QA Synthesis
Stage 4: Structured Extraction

System Modules

Context Generator (Data Synthesis)

Generate a background passage based on a meta-topic and cognitive prompt

Model or implementation: Target Model (e.g., Qwen2.5-VL) acts as the generator

Seed Question Generator (Data Synthesis)

Generate atomic, single-hop questions based on the background passage

Model or implementation: Target Model (e.g., Qwen2.5-VL)

Multi-Hop Composer (Data Synthesis)

Compose complex questions from seed questions using operators like Bridging or Comparison

Model or implementation: Target Model (e.g., Qwen2.5-VL)

Structured Extractor (Data Synthesis)

Format the output into schema-aligned data (Question, Answer, Trace)

Model or implementation: Target Model (e.g., Qwen2.5-VL)

Novel Architectural Elements

Cognitive Thinking Process Injection: A prompt-based architectural choice to explicitly decouple reasoning generation into specific cognitive operations (breadth, depth, progression)
Compositional Synthesis Pipeline: Explicitly generating 'atomic' single-hop seeds first, then programmatically prompting the model to compose them into multi-hop queries

Modeling

Base Model: Qwen2.5-VL-7B and Qwen2.5-VL-32B (also tested on InternVL series)

Training Method: Supervised Fine-Tuning (SFT) on synthesized data

Training Data:

400 synthesized samples per model
Generated using the model itself (self-synthesis)

Key Hyperparameters:

data_samples: 400

Compute: Not reported in the paper

Comparison to Prior Work

vs. MMInstruct/MMEvol: MindGYM focuses on 'thinking-centric' priors rather than just task diversity, reducing quality variance.
vs. RL4F: MindGYM is a data-centric approach (synthesis) rather than an optimization-centric approach (RL), reducing compute cost.
vs. CoT/ToT: MindGYM embeds the reasoning process into the *training data* itself via synthesis, rather than just using it as an inference trick.

Limitations

Multimodal synthesis relies on static image datasets (OK-VQA, ScienceQA) as anchors, limiting visual diversity
Current multimodal synthesis is not fully generative (doesn't generate images)
Experiments in the main text focus heavily on text-only data synthesis updates to LLM layers

Reproducibility

Code: https://github.com/modelscope/data-juicer/tree/MindGYM/

Code and data are released at https://github.com/modelscope/data-juicer/tree/MindGYM/. The paper details the prompt structure and synthesis stages. 400 samples were generated for experiments.

📊 Experiments & Results

Evaluation Setup

Fine-tuning VLMs on small sets of synthesized data and evaluating on reasoning benchmarks

Benchmarks:

MathVision-Mini (Multimodal Mathematical Reasoning)
MathVista-Mini (Multimodal Mathematical Reasoning)
MMStar (Multimodal Reasoning)
GSM8K (Text Math Reasoning)
MATH (Text Math Reasoning)
GPQA (General Purpose QA)

Metrics:

Accuracy
Data Quality Score (via Data-Juicer)
Data Quality Variance (via Data-Juicer)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MathVision-Mini	Accuracy	Not reported in the paper	Not reported in the paper	+16%
Data-Juicer Quality Metrics	Average Quality Improvement	Not reported in the paper	Not reported in the paper	+16.7%
Data-Juicer Quality Metrics	Quality Variance Reduction	Not reported in the paper	Not reported in the paper	-67.91%

Main Takeaways

Cognitively guided synthesis yields higher average quality data and significantly lower quality variance compared to baselines.
Low variance in data quality is critical for stable fine-tuning.
Improvements are consistent across model scales (7B, 32B) and architectures (Qwen, InternVL).
Chinese-synthesized data outperformed English and mixed-language variants in their observations.
Small amounts of high-quality synthetic data (400 samples) can yield double-digit percentage gains.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Instruction Tuning
Familiarity with Chain-of-Thought (CoT) prompting
Basics of Vision-Language Models (VLMs)

Key Terms

Cognitive Thinking Injection: The process of embedding structured reasoning objectives (like hypothesis testing or counterfactual reasoning) into prompts to guide model generation

Single-Hop QA: A question-answer pair that requires only one step of reasoning or evidence retrieval to solve

Multi-Hop QA: A complex question requiring multiple steps of reasoning, often composed by combining single-hop facts

Reject Sampling: A technique used here to discard generated questions that are semantically too similar to existing ones, ensuring diversity

Data-Juicer: A data processing toolkit used in this paper to analyze the quality and variance of the synthesized datasets

VLMs: Vision-Language Models—AI models capable of processing and reasoning over both image and text inputs