MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

📝 Paper Summary

Vision Language Models (VLMs) Self-Evolving Systems

MM-Zero enables Vision Language Models to self-evolve reasoning capabilities without external data by using a tri-role framework where agents propose tasks, generate executable code to render images, and solve them.

Core Problem

Self-evolving VLMs typically require seed image data, making them dependent on the quality and diversity of collected datasets and limiting scalability compared to text-only LLMs.

Why it matters:

Collecting and filtering image data is costly, labor-intensive, and limits the diversity of scenarios a model can learn from
Existing proposer-solver pipelines for VLMs remain bounded by the static distribution of pre-collected images
Synthetic generation via code allows for virtually unlimited variations and complex scenarios (e.g., charts, geometry) that are hard to mine from the web

Concrete Example: A standard VLM self-training loop might retrieve a static image of a chart and ask questions about it. If the dataset lacks complex 3D function plots, the model never improves on them. MM-Zero's Coder can write Python/SVG code to render a new, specific 3D plot requested by the Proposer, creating the training data on the fly.

Key Novelty

Tri-Role Zero-Data Self-Evolution

Expands the standard Proposer-Solver dual model to a three-agent system (Proposer, Coder, Solver) to bridge abstract concepts and visual data via code
Uses executable code (SVG/Python) as an intermediate representation to programmatically generate visual training data rather than retrieving existing images
Implements a 'Goldilocks' reward mechanism where the Proposer is incentivized to generate tasks that are challenging but solvable for the Solver

Architecture

The tri-role self-evolving training framework comprising Proposer, Coder, and Solver.

Breakthrough Assessment

9/10

Proposes a theoretically significant shift from data-driven to code-driven visual self-evolution, potentially removing the data bottleneck for VLM reasoning entirely.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised reinforcement learning for multimodal reasoning without external data

Inputs: None (starts from zero data/random seed prompts)

Outputs: Improved VLM policy capable of visual reasoning

Pipeline Flow

Proposer generates concept + easy/hard questions
Coder generates code (SVG/Python) for concept
Code execution renders Image
Solver answers questions based on Image
Rewards computed (Execution, Consistency, Difficulty) and propagated

System Modules

Proposer

Generate abstract visual captions, easy verification questions, and hard reasoning questions

Model or implementation: Base VLM (e.g., Qwen3-VL-Instruct)

Coder

Translate textual captions into executable code (SVG/Python) to render images

Model or implementation: Base VLM (initialized from Proposer checkpoint)

Solver

Perform multimodal reasoning over the synthesized image

Model or implementation: Base VLM

Novel Architectural Elements

Tri-role closed loop where visual data is synthesized internally via code execution rather than retrieved
Integration of code generation (Coder) as a differentiable bridge between abstract language (Proposer) and visual reasoning (Solver)

Modeling

Base Model: Qwen3-VL-Instruct (4B and 8B), Mimo-VL-7B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy without a critic model by normalizing rewards within a group of samples.

Formally: L_GRPO = -1/N * sum( min( ratio * A, clip(ratio, 1-eps, 1+eps) * A ) ) + beta * KL
Purpose: Reward Proposer for generating valid, solvable, yet difficult tasks.

Formally: R(x) = R_format + R_solv + R_diff + penalties (diversity, repetition)
Purpose: Reward Coder for executable and semantically accurate code.

Formally: R_D = w1*R_render + w2*R_solv + w3*R_diff - penalties
Purpose: Reward Solver for consistency and format adherence.

Formally: R_S = alpha*R_acc + beta*R_format (where R_acc is majority vote consistency)

Training Data:

Approximately 4,000 generated caption/QA pairs used to train Coder
Solver training data: filtered successfully rendered images + questions

Key Hyperparameters:

coder_rollout_size: 4
solver_rollout_size: 5
total_reward_range: [-1.0, 1.5]
+ 7 more
delta_eh (difficulty threshold): 0.15
lambda_eh (easy-hard penalty weight): 0.3
phi (content type threshold): 0.5
lambda_ct (content type penalty): 0.15
lambda_div (diversity weight): 0.5
lambda_err (coder error penalty): -0.1 (render fail), -0.05 (syntax error)
alpha (solver accuracy weight): 0.9

Comparison to Prior Work

vs. VisPlay/EvolMM: MM-Zero generates its own visual data via code, whereas others require external image datasets
vs. LLM Self-Evolution (e.g., R-Zero): MM-Zero extends the paradigm to multimodal by synthesizing the visual modality, rather than text-only evolution

Limitations

Dependency on the base model's ability to generate valid code; if the Coder cannot write SVG/Python, the loop breaks
Computational cost of rendering images and performing multiple rollouts for rewards
No ground truth for hard questions requires relying on self-consistency, which can reinforce hallucinations if the model is confidently wrong

Reproducibility

Code: https://github.com/zli12321/MM-Zero

Code is publicly available at https://github.com/zli12321/MM-Zero. The paper specifies exact reward weights and filtering thresholds (e.g., keeping examples with render success rate 0.25-0.75). Base models Qwen3-VL-Instruct and Mimo-VL-7B-Instruct are used.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on standard multimodal benchmarks after self-evolution

Benchmarks:

Not specifically listed in snippet (Multimodal Reasoning)

Metrics:

Accuracy
Pass@k
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The provided text describes the methodology and reward mechanisms in detail but ends before the experimental results section.
The framework incentivizes 'Goldilocks' tasks: questions that are challenging (high variance in Solver answers) but essentially solvable (verifiable via Easy questions).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Vision Language Models (VLMs)
Proximal Policy Optimization (PPO) concepts

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on the relative performance of a group of outputs rather than a value function

RLVR: Reinforcement Learning with Verifiable Rewards—a training paradigm where model outputs are scored based on objective verification (e.g., correct answer, successful code execution)

SVG: Scalable Vector Graphics—an XML-based vector image format that can be generated via code

TTRL: Test-Time Reinforcement Learning—using the model's own consistency (majority vote) during inference/generation as a proxy for correctness when ground truth is missing

Goldilocks principle: A reward strategy that incentivizes generating tasks of intermediate difficulty (not too hard, not too easy) to maximize learning signal

Proposer: The agent role responsible for formulating visual concepts and questions

Coder: The agent role responsible for translating concepts into executable code to render images

Solver: The agent role responsible for reasoning over the rendered images to answer questions