Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Reinforcement Learning (RL) for Reasoning

A two-stage training framework for multimodal models that first establishes strong reasoning patterns via supervised fine-tuning (cold start) before refining them with reinforcement learning, outperforming RL-only approaches.

Core Problem

Directly applying reinforcement learning (Zero RL) to multimodal models often fails to induce genuine reasoning capabilities, as the observed 'aha moments' are frequently hallucinations rather than effective self-correction.

Why it matters:

Current beliefs that 'aha moments' (reflective patterns) autonomously emerge and indicate improved reasoning in MLLMs may be misconceptions
Reinforcement learning alone struggles to discover effective reasoning strategies from scratch in the multimodal domain without a strong initial foundation
Existing methods either rely solely on SFT (Supervised Fine-Tuning) or jump straight to RL, missing the synergy of combining both for scalable reasoning

Concrete Example: When solving a parallelogram geometry problem, a base model might generate reflective text like 'Wait, let's re-evaluate,' but then immediately proceed to use the same incorrect logic (e.g., claiming angles sum to 100° instead of 180°), showing that the pattern exists but is functionally useless.

Key Novelty

SFT-Cold-Start followed by GRPO (Group Relative Policy Optimization)

Demonstrates that 'aha moment' patterns exist in base models prior to RL and do not inherently correlate with correctness, challenging the 'emergent' view
Proposes explicitly initializing the model with high-quality Chain-of-Thought data (Cold Start) distilled from larger models before applying RL
Shows that reasoning *format* (structure) learned during cold start is crucial for subsequent RL success, even if the cold start data contains errors

Architecture

Overview of the two-stage training methodology: Cold Start (SFT) followed by Reinforcement Learning.

Evaluation Highlights

+6.19 average score improvement on 4 multimodal benchmarks for the 7B model compared to the Qwen2.5-VL-7B base model
+10.84 average score improvement for the 3B model, allowing it to outperform several 7B baselines like Qwen2.5-VL-7B and VLAA-Thinker-7B
Achieves 73.4% on MathVista (7B model), surpassing GPT-4o (59.5%) and Skywork R1V (67.5%)

Breakthrough Assessment

8/10

Provides critical empirical evidence debunking the 'emergent aha moment' in current MLLM RL work and establishes a SOTA pipeline (SFT+RL) that allows 3B models to beat 7B baselines.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning tasks involving visual inputs (geometry, charts) and text queries, optimized via a two-stage post-training process

Inputs: Image I and Question q

Outputs: Reasoning chain and final answer o

Pipeline Flow

Data Construction (Distillation from Teacher MLLM)
Cold Start (Supervised Fine-Tuning)
Reinforcement Learning (GRPO)

System Modules

Data Construction

Generate high-quality reasoning data for Cold Start

Model or implementation: Qwen2.5-VL-32B (Teacher) or Qwen2.5-VL-7B

Supervised Fine-Tuning (Cold Start)

Initialize the model with structured reasoning patterns

Model or implementation: Qwen2.5-VL (3B and 7B variants)

Reinforcement Learning

Optimize the policy to maximize correct reasoning and answer generation

Model or implementation: SFT-tuned MLLM

Novel Architectural Elements

Unified two-stage post-training framework (SFT+RL) explicitly analyzing the correlation between Cold Start strategies and RL outcomes in the multimodal domain

Modeling

Base Model: Qwen2.5-VL (3B and 7B)

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while staying close to reference model.

Formally: J(θ) = E [ (1/G) * Sum( (Advantage * Ratio) - beta * D_KL(policy || ref) ) ]

Training Data:

50k examples from 12 open-source datasets (Geometry3K, GeoQA, AI2D, ChartQA, etc.)
Data synthesized via rejection sampling from Qwen2.5-VL-32B (Distilled-CoT)

Key Hyperparameters:

group_size_G: Not reported in the paper
beta_kl: Not reported in the paper
learning_rate: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: Extends the SFT+RL paradigm to the *multimodal* domain
vs. MM-Eureka: Challenges their finding that 'aha moments' are emergent and effective; shows they pre-exist and can be hallucinations
vs. Zero RL approaches (e.g. R1-V): Empirically demonstrates that SFT+RL consistently outperforms RL-only (Zero RL) in multimodal tasks
+ 1 more
vs. QvQ-72B: Achieves competitive performance with a much smaller (7B) model via targeted SFT+RL

Limitations

The 'aha moment' analysis is primarily observational and qualitative
Specific hyperparameters for GRPO (e.g., KL coefficient, learning rates) are missing from the text
Analysis is limited to Qwen2.5-VL architectures; generalizability to other MLLM families is not tested
Does not explore iterative RL or self-play beyond the single GRPO stage

Reproducibility

Code: https://github.com/waltonfuture/RL-with-Cold-Start

Code is publicly available at https://github.com/waltonfuture/RL-with-Cold-Start. The paper lists all seed datasets used. Hyperparameters like learning rate and batch size are not explicitly detailed in the text.

📊 Experiments & Results

Evaluation Setup

Multimodal mathematical and visual reasoning

Benchmarks:

MathVision (Multimodal mathematical reasoning)
MathVerse (Geometric and visual math problems)
MathVista (Visual math QA)
We-Math (Human-like mathematical reasoning)

Metrics:

Accuracy (Score)
Effective Rank (for model representation analysis)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of the proposed 7B model against state-of-the-art closed and open-source models across four benchmarks.
MathVista	Score	66.30	73.40	+7.10
We-Math	Score	62.87	70.40	+7.53
Average (4 benchmarks)	Score	49.47	55.66	+6.19
Comparison of the proposed 3B model against other 3B scale models.
Average (4 benchmarks)	Score	40.00	50.84	+10.84
Ablation study demonstrating the necessity of the Cold Start (SFT) phase before RL.
Average (4 benchmarks)	Score	48.79	50.84	+2.05
Average (4 benchmarks)	Score	53.62	55.66	+2.04

Experiment Figures

Radar charts comparing the proposed 3B and 7B models against baselines on four benchmarks.

Bar charts contrasting Frequency vs. Accuracy of 'aha moments' across three models (Qwen2.5-VL, VLAA-Thinker, MM-Eureka).

Performance comparison between models trained on 'aha moment' data vs. randomly selected data.

Main Takeaways

SFT acts as a crucial 'Cold Start' that consistently improves downstream RL performance compared to starting RL from the base model (Zero RL).
Using Distilled-CoT data from a larger teacher (32B) yields the best Cold Start performance compared to other strategies like Caption-CoT or Self-Critic-CoT.
The 'aha moment' (reflective pattern) increases in frequency with RL but often correlates with *lower* accuracy if not grounded in correct reasoning during Cold Start.
Even training on 'Unjudged' or 'Wrong' CoT data during Cold Start improves performance over the base model, suggesting the model benefits from learning the *structure* of reasoning.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically GRPO)
Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) prompting
Multimodal Large Language Models (MLLMs)

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of sampled outputs to estimate advantages without a value function

Cold Start: The initial phase of training where a model is Supervised Fine-Tuned (SFT) on high-quality data to establish a baseline capability before Reinforcement Learning

SFT: Supervised Fine-Tuning—training a model on labeled examples to teach it specific behaviors or formats

Chain-of-Thought: A reasoning technique where the model generates intermediate steps before producing the final answer

Distilled-CoT: Training data generated by a larger, more capable 'teacher' model (e.g., 32B) to teach a smaller 'student' model (e.g., 3B)

Aha Moment: A reflective pattern where a model seemingly pauses to re-evaluate its reasoning (e.g., 'Wait, let me check'), often associated with self-correction

Rejection Sampling: A data filtering method where multiple responses are generated, and only those that match the correct ground truth answer are kept for training

KL divergence: A measure of how much a probability distribution differs from a reference distribution, used here to prevent the RL model from drifting too far from the original model

Effective Rank: A metric measuring the effective dimensionality of the matrix formed by the hidden states of the model, often correlated with the amount of knowledge encoded

Qwen2.5-VL: The base Vision-Language Model family used in this paper, capable of processing both text and images