Unlocking Multimodal Mathematical Reasoning via Process Reward Model

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Mathematical Reasoning Reinforcement Learning

URSA is a framework that improves multimodal math reasoning by synthesizing large-scale process supervision data and using a novel Process-Supervised Group Relative Policy Optimization (PS-GRPO) to mitigate reward hacking and length bias.

Core Problem

Multimodal LLMs struggle with complex math due to data scarcity and the difficulty of applying Process Reward Models (PRMs) without causing reward hacking or length bias during Reinforcement Learning.

Why it matters:

Existing Test-Time Scaling and RL methods rely on strong foundation models and high-quality process labels, which are scarce for multimodal tasks
Naive application of scalar process rewards in RL leads to 'reward hacking' (optimizing for rewards rather than correctness) and 'length bias' (models becoming lazy/short to avoid risk)
Automated process labeling for vision-language tasks is unexplored compared to text-only math

Concrete Example: When using standard scalar process rewards in RL, the model learns that later steps in a reasoning chain are often penalized by the PRM (due to conservative labeling). Consequently, the model collapses to generating extremely short, heuristic-based answers to minimize the risk of negative rewards, harming reasoning depth.

Key Novelty

Unfolding multimodal pRocess-Supervision Aided (URSA) framework

Constructs MMathCoT-1M (reasoning data) and DualMath-1.1M (process data) using automated expansion, rewriting, and error injection strategies
Introduces PS-GRPO, an RL algorithm that discards unreliable scalar process rewards and instead uses 'drop-moments' (sudden drops in process quality) to penalize correct-outcome rollouts that contain reasoning flaws
Utilizes a dual-view data synthesis engine: Binary Error Locating for logical errors and Misinterpretation Insertion Engine for visual hallucinations

Architecture

The three-stage URSA framework pipeline: (I) Data Curation & SFT, (II) Dual-View Process Data Synthesis & PRM Training, and (III) Process-Supervised GRPO (PS-GRPO).

Evaluation Highlights

URSA-8B-PS-GRPO outperforms GPT-4o by 2.7% on average across 6 multimodal math benchmarks (e.g., +20.6% on MathVista-GPS)
Surpasses open-source baseline Gemma3-12B by 8.4% on average despite being smaller (8B parameters)
URSA-8B-RM (Process Reward Model) improves Test-Time Scaling, achieving 16.6% relative improvement on MathVerse with just Best-of-4 sampling

Breakthrough Assessment

8/10

Significant contribution in applying PRMs to multimodal settings. The release of 1M+ multimodal CoT/process datasets and a robust RL method (PS-GRPO) addressing length bias makes this highly impactful for the open-source community.

⚙️ Technical Details

Problem Definition

Setting: Multimodal mathematical reasoning where a model accepts image-text pairs and generates step-by-step solutions

Inputs: Image I and Question Q

Outputs: Reasoning chain S = {s_1, ..., s_N} and final answer A

Pipeline Flow

Input Processing: Image + Question
Generation: URSA-8B generates candidate solution steps
Verification (during TTS): URSA-8B-RM scores each step
Selection: Choose solution with highest aggregated process reward

System Modules

Vision Encoder

Extract visual features from mathematical images

Model or implementation: Hybrid: SAM-B + SigLIP-L

Foundation Model (URSA-8B)

Generate step-by-step reasoning and final answers

Model or implementation: Qwen2.5-Math-Instruct (LLM backbone)

Process Reward Model (URSA-8B-RM)

Score the validity of each reasoning step

Model or implementation: URSA-8B with classification head

Novel Architectural Elements

PS-GRPO Integration: The pipeline integrates PRM signals not as additive rewards but as conditional penalties (gamma) triggered by relative reward drops (rho) within the Group Relative Policy Optimization loop

Modeling

Base Model: Qwen2.5-Math-Instruct (LLM) + SigLIP-L/SAM-B (Vision)

Training Method: Process-Supervised Group Relative Policy Optimization (PS-GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to reference model.

Formally: standard GRPO objective using advantage A_i.
Purpose: Calculate advantage by penalizing correct answers that have bad process steps.

Formally: Reward r_i = 1 if correct and no 'drop-moment', else 1 - gamma if correct but has drop-moment, else 0.

Adaptation: Full parameter tuning (LLM backbone + Projector)

Training Data:

Stage I: MMathCoT-1M (1M samples) from open-source datasets via expansion/rewriting
Stage II: DualMath-1.1M (1.1M process samples) via Binary Error Locating and Misinterpretation Insertion

Key Hyperparameters:

gamma (penalty coefficient): 0.5
rho (drop threshold): 0.3
batch_size: Not explicitly reported in the paper
+ 1 more
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: URSA integrates multimodal *process* supervision into GRPO via drop-penalties to fix length bias, whereas R1 focuses on pure outcome/formatting rewards
vs. AtomThink: URSA uses explicit PRM-guided RL and larger-scale synthesized process data (1.1M vs smaller scale)
vs. Math-Shepherd [not cited in paper]: Math-Shepherd uses step-by-step verification for text math; URSA extends this to multimodal with specific 'hallucination insertion' data synthesis

Limitations

Significant performance gap remains on the DynaMath benchmark compared to larger models
Reliance on a stronger model (Gemini-1.5-Flash) for data synthesis introduces dependency on proprietary API costs
Evaluation is limited to zero-shot settings; few-shot performance not explored
The method focuses on English-heavy math benchmarks

Reproducibility

Code: https://github.com/URSA-MATH

Code, data (MMathCoT-1M, DualMath-1.1M), and checkpoints (URSA-8B, URSA-8B-RM) are publicly available at https://github.com/URSA-MATH. Detailed prompts for data synthesis are in Appendix G. Training time and specific GPU resources are not explicitly reported.

📊 Experiments & Results

Evaluation Setup

Zero-shot inference on multimodal mathematical reasoning tasks

Benchmarks:

MathVerse (Visual Math Reasoning)
DYNAMATH (Dynamic Math Problems)
MathVista (General Visual Math)
GeoQA (Geometry QA)
MathVision (Visual Math)
WE-MATH (World Knowledge Math)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against SOTA MLLMs showing URSA-8B-PS-GRPO's superiority over comparably sized and even larger proprietary models.
Average (6 benchmarks)	Accuracy	51.1	59.5	+8.4
Average (6 benchmarks)	Accuracy	56.8	59.5	+2.7
MathVista-GPS	Accuracy	62.6	83.2	+20.6
Ablation study demonstrating the effectiveness of the PS-GRPO algorithm compared to vanilla GRPO.
Average (6 benchmarks)	Accuracy improvement	3.1	6.8	+3.7
Test-Time Scaling (TTS) performance using URSA-8B-RM as a verifier.
MathVerse	Accuracy (Best-of-4)	47.2	55.0	+7.8

Experiment Figures

Comparison of different Reward Modeling strategies in RL: Vanilla GRPO vs. Scalar Process Rewards (Variants 1 & 2).

Stability of PRM internal signals during online RL training.

Main Takeaways

PS-GRPO effectively mitigates reward hacking and length bias observed in scalar reward RL by using relative quality drops (drop-moments) instead of absolute scores
Constructing high-quality multimodal CoT data (Stage I) and process supervision data (Stage II) is critical for unlocking the potential of RL and TTS
Automated insertion of visual misinterpretations (MIE) allows the PRM to learn to detect grounding errors, a unique challenge in multimodal reasoning
URSA-8B-PS-GRPO achieves state-of-the-art results among 8B models and competes with proprietary models like GPT-4o on math tasks

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Multimodal Large Language Models (architecture and training)
Process Reward Models (PRMs) and Monte Carlo Tree Search (MCTS)

Key Terms

PRM: Process Reward Model—a model trained to score the correctness of each intermediate step in a reasoning chain, rather than just the final answer

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated for the same input, avoiding the need for a separate value function

PS-GRPO: Process-Supervised Group Relative Policy Optimization—the authors' proposed RL method that uses PRM 'drop-moments' to penalize inconsistent reasoning paths even if the final answer is correct

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

TTS: Test-Time Scaling—improving performance during inference by generating multiple solutions and selecting the best one (e.g., using a PRM)

Drop-moment: A specific step in a reasoning chain where the PRM's predicted correctness score drops significantly compared to the previous step, indicating a potential error

MCTS: Monte Carlo Tree Search—a search algorithm used here to estimate the correctness probability of reasoning steps by simulating multiple future outcomes

SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs