MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

📝 Paper Summary

Multimodal Reasoning Reinforcement Learning from Human Feedback (RLHF)

MMR1 stabilizes multimodal reinforcement learning by dynamically sampling prompts that maximize reward variance—balancing correct and incorrect outcomes and diverse reasoning paths—to prevent gradient vanishing.

Core Problem

Group Relative Policy Optimization (GRPO) suffers from gradient vanishing when sampled rewards have low variance (e.g., all correct or all incorrect), weakening optimization signals.

Why it matters:

Standard RL fine-tuning often collapses because relative advantages approach zero without variance, wasting computation and stalling learning.
Existing multimodal datasets lack the scale and quality of long Chain-of-Thought (CoT) data needed for effective reasoning, constraining reproducibility.
Current solutions like filtering by pass rate are heuristic and lack theoretical guarantees regarding gradient magnitude.

Concrete Example: If a model answers a hard math problem incorrectly 32 out of 32 times, the reward variance is zero. GRPO computes advantages relative to the group mean (also zero), resulting in zero gradients and no learning update, even though the model failed.

Key Novelty

Variance-Aware Sampling (VAS)

Selects training data based on a Variance Promotion Score (VPS) that prioritizes prompts likely to yield mixed outcomes (some right, some wrong) and diverse reasoning paths.
Combines Outcome Variance Score (OVS), which targets a 50% pass rate, with Trajectory Diversity Score (TDS), which ensures gradient signal even when correctness feedback is sparse.
Mixes this targeted sampling with uniform random sampling to ensure broad distribution coverage while boosting optimization stability.

Architecture

Overview of the MMR1 framework illustrating the Variance-Aware Sampling (VAS) mechanism within the RL loop.

Evaluation Highlights

MMR1-7B achieves state-of-the-art average score of 58.4 across 5 multimodal reasoning benchmarks, surpassing comparable reasoning models like R1-VL-7B (47.7).
On MathVerse, MMR1-7B scores 55.4, outperforming Qwen2.5-VL-7B (50.4) and InternVL2.5-8B (40.0).
MMR1-3B (small scale) achieves 52.7 average, matching or exceeding several 7B baselines like OpenVLThinker-7B (52.5).

Breakthrough Assessment

8/10

Provides a theoretically grounded solution to RL gradient vanishing via data sampling and releases significant high-quality open resources (1.6M CoT data), addressing both algorithmic and data bottlenecks.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning task where a policy generates a response y given a visual-text prompt x, optimized via Reinforcement Learning (GRPO).

Inputs: Prompt x containing image and text.

Outputs: Reasoning chain and final answer y.

Pipeline Flow

Variance-Aware Sampling (Selects batch based on VPS)
Policy Rollout (Generates N responses per prompt)
Reward & VPS Update (Evaluates correctness, updates OVS/TDS)
GRPO Update (Optimizes policy using relative rewards)

System Modules

Dynamic Sampler

Selects training batch using a mix of Variance Promotion Score (VPS) weighted sampling and uniform sampling.

Model or implementation: Non-parametric selector

Policy Model

Generates N reasoning trajectories and answers for each prompt.

Model or implementation: Qwen2.5-VL (fine-tuned)

Verifier & Scorer

Computes rewards (correctness) and updates VPS components (OVS and TDS).

Model or implementation: Rule-based / Exact Match

Novel Architectural Elements

Variance-Aware Sampling loop: A dynamic data selection mechanism integrated into the RL training loop that prioritizes samples with high estimated gradient norms based on outcome variance and trajectory diversity.

Modeling

Base Model: Qwen2.5-VL-Instruct (3B, 7B, 72B variants used)

Training Method: Group Relative Policy Optimization (GRPO) with Variance-Aware Sampling (VAS)

Objective Functions:

Purpose: Maximize expected reward using relative advantages within groups.

Formally: GRPO objective (standard policy gradient with group normalization and KL penalty).
Purpose: Select data to maximize gradient magnitude.

Formally: Sampling probability proportional to VPS = alpha * OVS + beta * TDS.

Training Data:

Cold-start SFT: ~1.6M items (Math, General, Chart, Table, Science)
RL: ~15k items (8k Hard Math + 7k Logical Reasoning from Raven, MM-IQ, EasyArc)

Key Hyperparameters:

learning_rate: 1e-5 (SFT), Not explicitly reported for RL but codebase linked
batch_size: Global batch size not explicitly reported
group_size_N: 32
+ 4 more
mix_ratio_lambda: 0.5
alpha: 0.8
beta: 0.2
T_update: 35 steps

Compute: Not reported in the paper

Comparison to Prior Work

vs. R1-VL: MMR1 uses Variance-Aware Sampling to stabilize GRPO, whereas R1-VL uses standard training.
vs. Standard GRPO: MMR1 theoretically proves reward variance lower-bounds gradient magnitude and actively samples to maximize this variance.
vs. Heuristic Filtering (e.g., removing easy/hard samples): MMR1 uses a dynamic, continuous score (VPS) combining outcome and trajectory diversity rather than static thresholds [not cited in paper].

Limitations

Computational cost of updating VPS (requires re-computing diversity metrics periodically).
Reliance on verifiable rewards (math/logic) limits applicability to open-ended creative tasks.
Ablation studies used smaller group size (N=8) than main experiments (N=32) due to resource constraints.

Reproducibility

Code: https://github.com/LengSicong/MMR1

publicly available (https://github.com/LengSicong/MMR1). Codebase, 1.6M cold-start data, 15k RL data, and model checkpoints (3B, 7B) are released. Training hyperparameters for ablation studies (N=8) differ from main results (N=32).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on mathematical and logical reasoning benchmarks.

Benchmarks:

MathVerse (Multimodal Mathematical Reasoning)
MathVista (Visual Mathematical Reasoning)
MathVision (Visual Mathematical Reasoning)
LogicVista (Logical Reasoning)
ChartQA (Chart Understanding)

Metrics:

Accuracy (Exact Match or equivalent)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MMR1-7B outperforms both general-purpose and reasoning-specific baselines across most benchmarks.
MathVerse	Accuracy	50.4	55.4	+5.0
ChartQA	Accuracy	76.3	83.7	+7.4
LogicVista	Accuracy	37.7	48.9	+11.2
Ablation studies confirm the contribution of both Outcome Variance Score (OVS) and Trajectory Diversity Score (TDS) to performance.
MathVista	Accuracy	63.2	66.5	+3.3
MathVista	Accuracy	65.6	66.5	+0.9
MathVista	Accuracy	65.1	66.5	+1.4

Experiment Figures

Figure 2 (implied from context, ablation/convergence plots)

Convergence curves comparing VAS against Random Sampling and other baselines.

Main Takeaways

Variance-Aware Sampling (VAS) effectively mitigates gradient vanishing, leading to faster convergence and higher final performance compared to random sampling.
Both outcome variance (correct/incorrect balance) and trajectory diversity (reasoning path variety) are complementary; removing either degrades performance.
The release of high-quality, large-scale (1.6M) CoT data and curated RL data (15k) enables robust training of multimodal reasoning models.
Theoretical analysis confirms that reward variance lower-bounds the expected policy gradient magnitude, justifying the strategy of maximizing variance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (REINFORCE, GRPO)
Chain-of-Thought (CoT) reasoning
Gradient Vanishing in Policy Gradients

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance, but risks gradient vanishing if intra-group variance is low.

VAS: Variance-Aware Sampling—the proposed strategy to sample prompts that maximize expected reward variance.

VPS: Variance Promotion Score—a metric combining outcome variance and trajectory diversity to guide sampling.

OVS: Outcome Variance Score—component of VPS measuring the variance of correctness (Bernoulli variance), maximized when pass rate is 0.5.

TDS: Trajectory Diversity Score—component of VPS measuring diversity of reasoning paths (e.g., inverse self-BLEU), providing a lower bound on variance.

CoT: Chain-of-Thought—a reasoning method where the model generates intermediate steps before the final answer.

Self-BLEU: A metric measuring diversity by calculating BLEU scores between generated sequences; lower Self-BLEU implies higher diversity.

Pass rate: The fraction of generated responses for a given prompt that are correct.

Gradient vanishing: In this context, the phenomenon where policy gradient updates approach zero because the advantage function (reward minus baseline) becomes zero when all rewards in a group are identical.