Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

📝 Paper Summary

Multimodal Reasoning Process Reward Models (PRMs) Test-Time Scaling

Athena trains effective Process Reward Models using only ~5,000 samples by filtering automated labels through consistency checks between weak and strong completer models, drastically reducing data requirements.

Core Problem

Training Process Reward Models (PRMs) requires step-level labels that are expensive to annotate manually and noisy or computationally prohibitive to estimate automatically via Monte Carlo sampling.

Why it matters:

High-quality step-level feedback is crucial for complex multi-step reasoning in math and visual tasks, where Outcome Reward Models (ORMs) provide insufficient signal
Existing automated labeling methods (like Math-Shepherd) require hundreds of thousands of samples and massive compute to estimate labels via thousands of rollouts
Noisy labels from automated methods degrade reward model performance, as weak models may fail on correct steps and strong models may recover from incorrect ones

Concrete Example: A weak completer (e.g., 7B model) might fail to solve a problem even starting from a correct intermediate step, falsely labeling that step as 'incorrect'. Conversely, a strong completer (e.g., 72B) might recover from a subtle error in an intermediate step and solve the problem, falsely labeling the error as 'correct'. Standard Monte Carlo methods average these biases, creating noisy training data.

Key Novelty

Consistency-Filtered Process Labeling

Uses two distinct models for Monte Carlo estimation: a 'weak' completer and a 'strong' completer
Retains only those reasoning steps where both completers agree on the outcome (correct vs. incorrect), filtering out ambiguous or bias-prone labels
Initializes the fine-grained Process Reward Model (PRM) from a coarse-grained Outcome Reward Model (ORM) to leverage large-scale solution-level supervision before fine-tuning on steps

Evaluation Highlights

+10.2 points accuracy improvement on WeMath benchmark using Qwen2.5-VL-7B as the policy model with Athena-PRM verification
Achieves State-of-the-Art on VisualProcessBench with 83.1 F1 score, outperforming the previous best open-source model (VisualPRM-8B) by 3.9 points
Reduces computational costs significantly: requires only 1/45th of the GPU hours for data synthesis and 1/60th for training compared to vanilla Monte Carlo estimation baselines

Breakthrough Assessment

7/10

Significant for its extreme data efficiency (5K vs 300K samples) and practical methodology for training PRMs without human annotation. While the architecture is standard, the data curation strategy addresses a major bottleneck in reasoning research.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning where a solution consists of a sequence of steps a = (a_1, ..., a_K)

Inputs: Multimodal query x (image + text) and a generated reasoning step a_i

Outputs: Scalar reward score representing the probability that step a_i leads to a correct final answer

Pipeline Flow

Policy Model (generates N solutions)
Athena-PRM (scores each step of each solution)
Selection Strategy (selects solution with highest minimum step reward)

System Modules

Policy Model

Generate candidate solutions step-by-step

Model or implementation: Qwen2.5-VL-7B (or similar)

Athena-PRM

Assign a correctness probability to each reasoning step

Model or implementation: Qwen2.5-VL-7B trained as a classifier

Modeling

Base Model: Qwen2.5-VL-7B-Instruct (and other sizes for ablation)

Training Method: Supervised Fine-Tuning (for PRM classification)

Objective Functions:

Purpose: Train the PRM to predict step correctness.

Formally: Cross-entropy loss over binary labels (correct/incorrect) for each step in the reasoning path.

Training Data:

Athena-600K (ORM data): 600K diverse queries from MathVista, GeoQA, GSM8K, etc., with binary outcome labels.
Athena-5K (PRM data): ~5,000 samples filtered via consistency check between Weak Completer (Qwen2.5-VL-3B) and Strong Completer (Qwen2.5-VL-72B).

Key Hyperparameters:

monte_carlo_samples_T: 8
completer_weak: Qwen2.5-VL-3B
completer_strong: Qwen2.5-VL-72B

Compute: Data synthesis: 1/45th of vanilla MC GPU hours. Training: 1/60th of vanilla MC GPU hours.

Comparison to Prior Work

vs. Math-Shepherd: Uses dual-completer consistency (weak+strong) to filter labels and reduces data size from ~300K to ~5K
vs. VisualPRM: Achieves higher performance (83.1 vs 79.2 F1) with significantly less training data
vs. ORM-only [standard baseline]: Demonstrates that ORM initialization followed by PRM fine-tuning is superior to training PRM from scratch or using ORM alone

Limitations

Heavy reliance on the availability of both a weak and a strong model in the same family for consistency checking
Hard estimation ignores nuance (a step is strictly 0 or 1), unlike soft estimation which uses probabilities
Evaluation focuses primarily on math and visual reasoning; applicability to other domains (e.g., creative writing) is untested

Reproducibility

Code availability is not provided. Datasets used are public (MathVista, GSM8K, etc.), but the specific 'Athena-5K' subset is generated via the described method. Hyperparameters for MC sampling (T=8) and completer model sizes are specified.

📊 Experiments & Results

Evaluation Setup

Best-of-N verification (selecting best solution from N=64 candidates) and Direct Step Evaluation

Benchmarks:

WeMath (Multimodal Math Reasoning)
MathVista (Visual Math QA)
VisualProcessBench (Step-level Correctness Judgment)
MATH (Text-only Math Reasoning)

Metrics:

Accuracy (Pass@1, Best-of-N)
F1 Score (for step classification)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Test-time scaling results demonstrate Athena-PRM's ability to select correct solutions from generated candidates, significantly improving over the base policy model.
WeMath	Accuracy (Best-of-64)	40.5	50.7	+10.2
MathVista	Accuracy (Best-of-64)	57.6	64.7	+7.1
MATH	Accuracy (Best-of-64)	39.5	48.4	+8.9
Direct evaluation of step correctness shows Athena-PRM outperforms existing judges and PRMs.
VisualProcessBench	F1 Score	79.2	83.1	+3.9
Data efficiency comparison shows Athena-PRM (5K samples) outperforms vanilla MC methods (300K samples).
MATH	Accuracy	47.2	48.4	+1.2

Experiment Figures

Illustration of why vanilla Monte Carlo estimation is noisy. It shows a 'Strong Completer' finding the correct answer despite an incorrect intermediate step (False Positive label), while a 'Weak Completer' might fail even with correct steps.

Main Takeaways

High-quality data is far more important than quantity for PRMs: 5K filtered samples outperform 300K noisy samples.
Consistency between weak and strong completers effectively removes the bias inherent in Monte Carlo estimation.
Initializing Process Reward Models from Outcome Reward Models (ORM) provides a strong 'pre-training' foundation, treating outcome supervision as coarse-grained process supervision.
Up-sampling negative steps handles the inherent label imbalance in reasoning traces (where correct steps usually outnumber incorrect ones), improving discriminator performance.

📚 Prerequisite Knowledge

Prerequisites

Language Model Reasoning (Chain-of-Thought)
Reinforcement Learning from Human Feedback (RLHF)
Monte Carlo Estimation

Key Terms

PRM: Process Reward Model—a model that scores each intermediate step of a solution rather than just the final answer

ORM: Outcome Reward Model—a model that scores the entire solution based only on the final answer correctness

Completer: A language model used during data generation to finish a partial reasoning path to estimate if the current step can lead to a correct solution

Monte Carlo (MC) Estimation: A method to estimate the 'correctness' of a reasoning step by rolling out many possible completions and checking how many reach the correct answer

Test-Time Scaling (TTS): Improving performance during inference by generating multiple solutions and using a reward model to select the best one (Best-of-N)

Best-of-N: An evaluation strategy where N solutions are generated, and the one with the highest reward score is selected as the final answer