← Back to Paper List

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, Yi Yuan, Yifan Mao, Yuting Xiao, Ziping Ma
Inclusion AI, Ant Group
arXiv.org (2025)
MM Reasoning RL Benchmark

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Reinforcement Learning with Verifiable Rewards (RLVR) Spatial Reasoning
M2-Reasoning improves multimodal models by combining a high-quality spatial data synthesis pipeline with a dynamic RLVR training strategy that uses curriculum learning and continuous rewards for spatial tasks.
Core Problem
While recent MLLMs perform well on general reasoning via RLVR, they struggle with dynamic spatial interactions (motion, orientation, distance) and lack high-quality, verifiable training data for these domains.
Why it matters:
  • Current models fail to reason about the dynamic interplay of space and motion, which is essential for real-world robotics and navigation tasks
  • Existing RLVR approaches typically use binary rewards (correct/incorrect), which fail to provide meaningful gradients for continuous spatial values like distance or size estimation
  • High-quality reasoning trajectories for visual spatial tasks are scarce compared to text-based math or logic data
Concrete Example: When asked 'What is the distance between A and B?', a standard MLLM might guess a number that is incorrect but close. A binary reward rejects this entirely, providing no signal. M2-Reasoning uses a continuous reward to encourage the model as it gets closer to the true value.
Key Novelty
Unified General and Spatial Reasoning via Dynamic RLVR
  • Establishes a dual-domain data pipeline: generates rigorous CoT paths for general logic and synthesizes 3D spatial data (images/videos) with verifiable physical attributes (depth, size)
  • Employs a 'Step-wise Dynamic Optimization' strategy that sequences tasks by difficulty (curriculum) and dynamically weights samples during training based on their current learning value
  • Introduces Exponential Decay Numeric Matching (EDNM), a continuous reward function for spatial tasks that provides granular feedback for numerical estimations (e.g., distance) rather than binary success/failure
Evaluation Highlights
  • Achieves SOTA average score of 45.0 on 6 general reasoning benchmarks, outperforming InternVL3-8B (41.4) and WeThink-VL-7B (44.3)
  • Sets new SOTA on CV-Bench (spatial reasoning) with 82.3 average, surpassing InternVL3-8B (82.0) and Qwen2.5-VL-7B (75.0)
  • dominates fine-grained spatial tasks in VSI-Bench, achieving 55.4 on Room Size estimation compared to InternVL3-8B's 33.6
Breakthrough Assessment
8/10
Strong engineering contribution combining specialized data synthesis with tailored RLVR rewards. Effectively bridges the gap between abstract logical reasoning and concrete spatial perception in MLLMs.
×