Evaluation Setup
Multimodal reasoning across diverse domains (Math, Geometry, Logic)
Benchmarks:
- MathVerse (Visual Math Reasoning)
- MathVista (Visual Math Reasoning)
- We-Math (Visual Math Reasoning)
- MMMU (Multi-discipline Multimodal Reasoning)
- CMMMU (Chinese Multi-discipline Multimodal Reasoning)
- CV-Bench (Computer Vision Perception)
- MMStar (Multimodal Star)
- RealWorldQA (Real-world Question Answering)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Token perception analysis reveals that visual dependency is sparsely distributed: only a small fraction of tokens in a Chain-of-Thought actually rely on the image.
- Trajectory analysis shows significant divergence: only some correct reasoning paths are 'perception-driven', while others may be shortcuts; standard RL fails to distinguish these.
- VPPO achieves substantial gains (+19.2% on 7B, +7.6% on 32B) by explicitly targeting these pivotal tokens and perception-heavy trajectories.
- The method scales effectively from 7B to 32B models, suggesting robustness across model sizes.