M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Reinforcement Learning with Verifiable Rewards (RLVR) Spatial Reasoning

M2-Reasoning improves multimodal models by combining a high-quality spatial data synthesis pipeline with a dynamic RLVR training strategy that uses curriculum learning and continuous rewards for spatial tasks.

Core Problem

While recent MLLMs perform well on general reasoning via RLVR, they struggle with dynamic spatial interactions (motion, orientation, distance) and lack high-quality, verifiable training data for these domains.

Why it matters:

Current models fail to reason about the dynamic interplay of space and motion, which is essential for real-world robotics and navigation tasks
Existing RLVR approaches typically use binary rewards (correct/incorrect), which fail to provide meaningful gradients for continuous spatial values like distance or size estimation
High-quality reasoning trajectories for visual spatial tasks are scarce compared to text-based math or logic data

Concrete Example: When asked 'What is the distance between A and B?', a standard MLLM might guess a number that is incorrect but close. A binary reward rejects this entirely, providing no signal. M2-Reasoning uses a continuous reward to encourage the model as it gets closer to the true value.

Key Novelty

Unified General and Spatial Reasoning via Dynamic RLVR

Establishes a dual-domain data pipeline: generates rigorous CoT paths for general logic and synthesizes 3D spatial data (images/videos) with verifiable physical attributes (depth, size)
Employs a 'Step-wise Dynamic Optimization' strategy that sequences tasks by difficulty (curriculum) and dynamically weights samples during training based on their current learning value
Introduces Exponential Decay Numeric Matching (EDNM), a continuous reward function for spatial tasks that provides granular feedback for numerical estimations (e.g., distance) rather than binary success/failure

Evaluation Highlights

Achieves SOTA average score of 45.0 on 6 general reasoning benchmarks, outperforming InternVL3-8B (41.4) and WeThink-VL-7B (44.3)
Sets new SOTA on CV-Bench (spatial reasoning) with 82.3 average, surpassing InternVL3-8B (82.0) and Qwen2.5-VL-7B (75.0)
dominates fine-grained spatial tasks in VSI-Bench, achieving 55.4 on Room Size estimation compared to InternVL3-8B's 33.6

Breakthrough Assessment

8/10

Strong engineering contribution combining specialized data synthesis with tailored RLVR rewards. Effectively bridges the gap between abstract logical reasoning and concrete spatial perception in MLLMs.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning covering both general abstract tasks (math, logic) and spatial perception tasks (distance, size, relative position)

Inputs: Image or Video + Text Question q

Outputs: Reasoning chain (thought block) + Final Answer (answer block)

Pipeline Flow

Input Processing (Vision Encoder + Text Tokenizer)
Multimodal Fusion (LLM)
Generation (Reasoning Trajectory + Final Answer)

System Modules

Vision Encoder

Process images/videos at native resolution

Model or implementation: Based on M2-Omni vision encoder (SigLIP-like)

Reasoning LLM

Generate reasoning steps and final answer

Model or implementation: Qwen2.5-7B-Instruct (initialized)

Modeling

Base Model: Qwen2.5-7B-Instruct (Language) + Native Resolution Vision Encoder (from M2-Omni)

Training Method: Two-stage training: Cold-start SFT followed by GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize reward using group-relative advantages.

Formally: J_GRPO maximization with dynamic advantage weighting and KL penalty.
Purpose: Dynamically weight updates based on sample difficulty.

Formally: alpha = sigma * mean(R) * (1 - mean(R))
Purpose: Reward spatial numerical predictions continuously.

Formally: R_EDNM(x) = gamma * exp(-lambda * |x - x_gt| / (|x_gt| + epsilon))

Training Data:

Cold-start: 168K curated CoT samples (General) + 3.3M Image-Text + 2.9M Text-only
RLVR: 100K General + 18.7K Spatial Image + 7.5K Spatial Video
Spatial data synthesized via depth estimation/3D point clouds from real images

Key Hyperparameters:

sigma (advantage scaling): 7.2
gamma (EDNM scaling): 1
lambda (EDNM decay): 2
+ 2 more
learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. InternVL3: Adds dedicated RLVR stage with continuous rewards for spatial tasks; superior on specific spatial metrics (Room Size, Relation)
vs. Qwen2.5-VL: Significantly improved reasoning performance via CoT data curation and reinforcement learning
vs. DeepSeek-R1: Extends the RLVR paradigm to multimodal (image/video) and specifically addresses spatial perception which text models lack

Limitations

Constrained reasoning depth compared to specialized text-only models (shorter reasoning chains).
Pathological repetition observed in some instances (redundant phrases/loops).
Suboptimal visual perception leading to occasional hallucinations of non-existent objects.
Requires verifiable ground truth for rewards, limiting application to open-ended tasks.

Reproducibility

Code: https://github.com/inclusionAI/M2-Reasoning

Code available at https://github.com/inclusionAI/M2-Reasoning. Model weights on HuggingFace. Detailed data synthesis pipeline described (using WeThink-VL-7B for CoT and Qwen2.5-VL for filtering). Specific hyperparameters for RLVR dynamic weighting provided.

📊 Experiments & Results

Evaluation Setup

Evaluated on 8 benchmarks covering general multimodal reasoning (math, logic) and spatial reasoning (perception, physics).

Benchmarks:

MathVista (Mathematical reasoning in visual contexts)
CV-Bench (2D Spatial reasoning (Relation, Depth, Distance))
VSI-Bench (Video spatial imagination (Room Size, Appearance Order))

Metrics:

Accuracy (Top-1)
Average Score across subsets
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General reasoning performance shows M2-Reasoning-7B achieving state-of-the-art results among base-scale models.
MathVista	Accuracy	70.5	75.0	+4.5
MathVision	Accuracy	30.0	31.5	+1.5
LogicVista	Accuracy	51.2	50.0	-1.2
Spatial reasoning results demonstrate significant gains, particularly in complex estimation tasks.
CV-Bench	Average Score	82.0	82.3	+0.3
VSI-Bench	Room Size (RS)	33.6	55.4	+21.8
VSI-Bench	Average Score	42.1	42.3	+0.2

Main Takeaways

The proposed data pipeline and RLVR strategy successfully generalize to both abstract math tasks and concrete spatial tasks.
Continuous rewards (EDNM) are critical for training MLLMs on spatial estimation tasks (like Room Size), yielding massive gains (+21.8%) where binary rewards likely fail.
Curriculum learning combined with dynamic advantage weighting stabilizes training, allowing the model to effectively absorb diverse multi-task data.
Despite using a smaller 7B base, the model outperforms or matches larger/stronger baselines (like InternVL3-8B) on several key benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Proximal Policy Optimization (PPO) or GRPO variants
Chain-of-Thought (CoT) reasoning
Visual spatial primitives (depth, segmentation, normals)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using RL where the final answer can be programmatically checked for correctness

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to estimate advantages without a critic model

EDNM: Exponential Decay Numeric Matching—a reward function introduced in this paper that assigns partial credit based on how close a numerical prediction is to the ground truth

Cold-start: The initial supervised fine-tuning (SFT) phase used to bootstrap the model's reasoning capabilities before applying reinforcement learning

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer