Evaluation Setup
Evaluated on 8 benchmarks covering general multimodal reasoning (math, logic) and spatial reasoning (perception, physics).
Benchmarks:
- MathVista (Mathematical reasoning in visual contexts)
- CV-Bench (2D Spatial reasoning (Relation, Depth, Distance))
- VSI-Bench (Video spatial imagination (Room Size, Appearance Order))
Metrics:
- Accuracy (Top-1)
- Average Score across subsets
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| General reasoning performance shows M2-Reasoning-7B achieving state-of-the-art results among base-scale models. |
| MathVista |
Accuracy |
70.5 |
75.0 |
+4.5
|
| MathVision |
Accuracy |
30.0 |
31.5 |
+1.5
|
| LogicVista |
Accuracy |
51.2 |
50.0 |
-1.2
|
| Spatial reasoning results demonstrate significant gains, particularly in complex estimation tasks. |
| CV-Bench |
Average Score |
82.0 |
82.3 |
+0.3
|
| VSI-Bench |
Room Size (RS) |
33.6 |
55.4 |
+21.8
|
| VSI-Bench |
Average Score |
42.1 |
42.3 |
+0.2
|
Main Takeaways
- The proposed data pipeline and RLVR strategy successfully generalize to both abstract math tasks and concrete spatial tasks.
- Continuous rewards (EDNM) are critical for training MLLMs on spatial estimation tasks (like Room Size), yielding massive gains (+21.8%) where binary rewards likely fail.
- Curriculum learning combined with dynamic advantage weighting stabilizes training, allowing the model to effectively absorb diverse multi-task data.
- Despite using a smaller 7B base, the model outperforms or matches larger/stronger baselines (like InternVL3-8B) on several key benchmarks.