Evaluation Setup
Multimodal reasoning across mathematical, logical, and counting tasks
Benchmarks:
- Geometry3K (Geometric Reasoning)
- MathVista (Visual Math Reasoning)
- MathVerse (Visual Math Reasoning (Vision-Centric subset used))
- MMMU-Pro (Multi-discipline Multimodal Reasoning)
- LogicVista (Logical Reasoning)
- SuperClevr Counting (Counting / Visual Perception)
- We-Math (Math Reasoning)
Metrics:
- Accuracy (Exact Match)
- Perception Error Rate (manual analysis)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| PAPO consistently improves over GRPO and DAPO baselines across a suite of 8 multimodal benchmarks, with particularly strong gains on tasks requiring heavy visual interpretation. |
| 8 Multimodal Benchmarks (Avg) |
Relative Improvement |
0.0 |
17.5 |
+17.5
|
| Vision-Dependent Tasks |
Relative Improvement |
0.0 |
19.1 |
+19.1
|
| Manual Error Analysis (200 cases) |
Perception Error Reduction |
0.0 |
30.5 |
-30.5
|
Main Takeaways
- PAPO effectively forces the model to attend to visual inputs, as evidenced by a 30.5% reduction in perception errors compared to GRPO.
- The method is robust and works as a drop-in replacement for both GRPO and DAPO, showing consistent improvements across diverse benchmarks.
- Improvements are correlated with vision dependency: tasks that can be solved via text shortcuts see smaller gains (4.4%) compared to vision-centric tasks (up to 19.1%).
- Double Entropy Loss is critical for training stability, preventing the unbounded KL maximization from collapsing the model.