MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Mathematical Reasoning

MM-PRM is an automated framework that generates scalable step-level supervision for multimodal math problems using Monte Carlo Tree Search, training a reward model to significantly improve reasoning accuracy.

Core Problem

Multimodal Large Language Models (MLLMs) struggle with complex multi-step mathematical reasoning, often producing logically inconsistent steps or false positives where incorrect reasoning leads to the correct answer.

Why it matters:

Current Outcome Reward Models (ORMs) only evaluate final answers, failing to detect flawed intermediate logic or guide models through long reasoning chains
Existing Process Reward Models (PRMs) rely on expensive manual annotation or inefficient sampling methods, making them hard to scale for multimodal tasks
High false-positive rates in current models undermine interpretability and trustworthiness in educational and scientific applications

Concrete Example: In a geometry problem (Figure 2), a policy model correctly identifies parallel lines but then uses an incorrect property (Angle Bisector Theorem incorrectly applied) in Step 3. An outcome-based model might miss this if the final number happened to be correct by chance, but MM-PRM assigns a low score (0.02) to that specific flawed step.

Key Novelty

Scalable Automated Process Supervision for Multimodal Math

Constructs a high-quality seed dataset (MM-K12) of 10,000 verifiable multimodal math problems to initialize the supervision pipeline
Adapts Monte Carlo Tree Search (MCTS) to multimodal contexts, efficiently exploring reasoning paths to generate over 700,000 step-level correctness labels without human annotation
Trains a dense Process Reward Model (PRM) using soft labels derived from MCTS value estimates, preserving uncertainty better than hard binary labels

Architecture

The three-stage framework: Policy Construction, Data Generation (MCTS), and PRM Training

Evaluation Highlights

+8.88% accuracy improvement on MM-K12 test set when applying MM-PRM to the base MM-Policy model
+10.10% accuracy boost on OlympiadBench (OOD) when applied to InternVL2.5-8B, demonstrating strong generalization
Outperforms hard-label training by ~6% (42.8% vs 37.0% on MM-K12) by using soft labels that capture step-wise uncertainty

Breakthrough Assessment

8/10

Significantly advances multimodal reasoning by solving the data bottleneck for process supervision. The automated MCTS pipeline enables scalable PRM training without human labels, showing strong generalization.

⚙️ Technical Details

Problem Definition

Setting: Multimodal mathematical reasoning with step-by-step verification

Inputs: Multimodal input q (image + text) and a candidate reasoning path x = [x1, x2, ..., xT]

Outputs: Scalar reward scores for each intermediate step indicating the probability of correctness

Pipeline Flow

Policy Model Construction (Fine-tuning MLLM on math data)
Process Supervision Generation (MCTS-based data annotation)
Process Reward Model Training (Training classifier on step-level data)

System Modules

MM-Policy

Generate high-quality, structured Chain-of-Thought reasoning traces

Model or implementation: InternVL2.5-8B fine-tuned on ~4M math samples

OmegaPRM Engine (Adapted)

Automatically annotate reasoning steps by performing MCTS rollouts to verify if steps lead to correct answers

Model or implementation: MCTS algorithm using MM-Policy

MM-PRM

Predict the correctness probability of each reasoning step given the multimodal context and history

Model or implementation: InternVL2.5-8B (initialized from MM-Policy)

Novel Architectural Elements

Integration of visual context into the OmegaPRM MCTS-based annotation pipeline
Closed-loop framework where the policy model generates its own supervision data via search, which is then used to train a verifier (PRM)

Modeling

Base Model: InternVL2.5-8B

Training Method: Supervised Fine-Tuning (Policy) followed by Reward Modeling (PRM)

Objective Functions:

Purpose: Minimize difference between predicted step probability and MCTS soft label.

Formally: L_PRM = - sum( y_hat * log(p) + (1-y_hat) * log(1-p) )

Trainable Parameters: Language module updated; Vision encoder frozen

Training Data:

Policy Training: 5.1M samples from various datasets (R-CoT, NuminaMath, etc.)
PRM Training: 747,000 step-level annotations generated from 10k MM-K12 seed problems via MCTS

Key Hyperparameters:

learning_rate: 4e-6 (PRM training)
batch_size: 512 (PRM training)
epochs: 1 (PRM training)
+ 5 more
policy_learning_rate: 4e-5
policy_batch_size: 128
mcts_temperature: 1.0
mcts_topk: 50
mcts_topp: 0.9

Comparison to Prior Work

vs. OmegaPRM: Extends the MCTS supervision framework to multimodal contexts (handling images)
vs. MathShepherd: Uses efficient hierarchical search/binary search to find errors rather than simple rollout averaging
vs. PRM800k: Fully automated data generation requiring no human annotators beyond the seed dataset collection
+ 1 more
vs. URSA: Focuses on scalable pipeline and training dynamics (soft labels/learning rate) rather than RL integration [not cited in paper]

Limitations

PRM performance is bounded by the diversity of candidate paths generated by the policy model (cannot fix generation errors, only select)
Depends on the quality of the verifier/solver used to check final answers during MCTS rollouts
Computational cost of MCTS generation is higher than simple sampling
Training restricted to K-12 level math initially (though generalizes to Olympiad level)

Reproducibility

Code: https://github.com/ModalMinds/MM-PRM

📊 Experiments & Results

Evaluation Setup

Best-of-N (BoN) inference where N=16 candidates are generated and reranked by the PRM

Benchmarks:

MM-K12 (K-12 Math (In-domain)) [New]
OlympiadBench (Competition Math (Out-of-domain))
MathVista (Visual Math QA)
MathVerse (Visual Math (Diagrams))
MathVision (Abstract Visual Reasoning)

Metrics:

Answer Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Application of MM-PRM to the base policy model (MM-Policy) shows consistent improvements across all benchmarks.
MM-K12	Accuracy	33.92	42.80	+8.88
OlympiadBench	Accuracy	15.41	24.00	+8.59
MathVista	Accuracy	62.93	67.60	+4.67
MM-PRM generalizes to other model sizes (InternVL2.5 series) not used in PRM training.
MM-K12	Accuracy	27.01	37.80	+10.79
OlympiadBench	Accuracy	30.98	34.67	+3.69
Ablation study on labeling strategy confirms the superiority of soft labels over hard binary thresholds.
MM-K12	Accuracy	37.0	42.8	+5.8

Experiment Figures

Impact of candidate path count (N) and Learning Rate on performance

Qualitative example of PRM scoring a geometry problem

Main Takeaways

Process supervision via MM-PRM consistently outperforms outcome-based selection across in-domain and out-of-domain benchmarks
The PRM trained on K-12 data generalizes effectively to much harder problems (OlympiadBench) and larger models (78B), suggesting the learned verification logic is robust
Soft labels are critical for stable PRM training, outperforming hard binary labels by preserving step-wise uncertainty
Lower learning rates (4e-6) are essential for PRM fine-tuning to prevent degradation of pretrained knowledge

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Reinforcement Learning from Human Feedback (RLHF)
Monte Carlo Tree Search (MCTS)

Key Terms

PRM: Process Reward Model—a model that evaluates the correctness of each intermediate step in a reasoning chain rather than just the final answer

ORM: Outcome Reward Model—a model that provides a single scalar reward based only on the correctness of the final answer

MCTS: Monte Carlo Tree Search—a search algorithm that builds a decision tree by simulating future outcomes (rollouts) to estimate the value of current states

CoT: Chain-of-Thought—a prompting strategy where models generate intermediate reasoning steps before the final answer

BoN: Best-of-N—an inference strategy where the model generates N candidate solutions and a reward model selects the best one

Soft label: Training targets that use continuous probabilities (e.g., 0.83) derived from empirical success rates rather than binary 0/1 values

Policy Model: The generative model (LLM/MLLM) used to produce the reasoning paths that are subsequently scored or trained upon