DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

📝 Paper Summary

Multimodal Reasoning Process Reward Models (PRM)

DreamPRM improves multimodal reasoning by training a Process Reward Model using bi-level optimization to automatically learn domain weights that prioritize high-quality datasets and filter out noise.

Core Problem

Training effective Process Reward Models (PRMs) for multimodal tasks is hampered by severe quality imbalances across datasets, where noisy or trivial samples degrade the model's ability to generalize.

Why it matters:

Multimodal inputs create a massive distribution shift from training to testing, making generalization far harder than in text-only settings.
Existing datasets contain many 'easy' or noisy samples (e.g., unnecessary modalities) that contribute little to learning but consume training budget.
Naive combinations of datasets fail because high-quality signals get drowned out by low-quality ones, leading to unreliable reward models.

Concrete Example: In Figure 1, some datasets contain questions with negligible difficulty or unnecessary images that provide no reasoning challenge. A standard PRM treats these equal to complex geometry problems, learning trivial correlations instead of robust verification logic.

Key Novelty

Bi-Level Optimization for Domain Reweighting

Treats dataset importance weights as learnable parameters in an upper-level optimization loop, while the PRM parameters are updated in a lower-level loop.
Uses a novel 'aggregation function loss' on a high-quality meta-dataset (validation set) to guide the learning of domain weights, ensuring the PRM prioritizes data that improves final answer selection.
Dynamically down-weights noisy or easy domains during training without requiring manual filtering rules.

Architecture

Overview of the DreamPRM training framework showing the bi-level optimization process.

Evaluation Highlights

Achieves 85.2% top-1 accuracy on the MathVista leaderboard using o4-mini, surpassing state-of-the-art models like GPT-4V and Gemini-1.5-Pro.
Consistently improves base model (InternVL-2.5-8B-MPO) performance by ~4% on average across five multimodal reasoning benchmarks compared to vanilla PRM.
Outperforms heuristic data selection strategies (s1-PRM, CaR-PRM) by 1-2%, proving that learned weights are superior to manual rules.

Breakthrough Assessment

8/10

Strong empirical results on major leaderboards (MathVista) and a theoretically sound application of bi-level optimization to the specific problem of data quality in multimodal PRMs.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning where a model generates a step-by-step solution y given image I and text t.

Inputs: Multimodal pair x = (t, I) consisting of textual instruction t and visual input I.

Outputs: A scalar reward score for each intermediate reasoning step, aggregated to select the best final answer.

Pipeline Flow

Step 1: MLLM generates N candidate Chain-of-Thought solutions for a given input (t, I).
Step 2: DreamPRM scores each step of every candidate solution.
Step 3: Aggregation Function combines step scores into a trajectory score.
Step 4: Select candidate with highest trajectory score as final answer.

System Modules

Base Generator (MLLM)

Generate candidate reasoning paths.

Model or implementation: InternVL-2.5-8B-MPO (also tested with GPT-4o-mini, o4-mini)

DreamPRM (Verifier)

Assign correctness probabilities to each step of the reasoning chains.

Model or implementation: Qwen2-VL-2B-Instruct (fine-tuned)

Novel Architectural Elements

Bi-level optimization loop where the upper level optimizes domain weights using a meta-set aggregation loss, and the lower level optimizes PRM parameters.

Modeling

Base Model: Qwen2-VL-2B-Instruct (for the PRM)

Training Method: Bi-level Optimization (BLO) with Monte Carlo signal estimation

Objective Functions:

Purpose: Train PRM parameters (lower level).

Formally: Weighted MSE loss between predicted scores and Monte Carlo value estimates, weighted by domain importance alpha.
Purpose: Train domain weights (upper level).

Formally: MSE loss between the aggregated trajectory score (passed through sigmoid) and binary correctness label on a held-out meta-dataset.

Adaptation: Full fine-tuning of PRM; domain weights alpha are scalar parameters.

Training Data:

15 multimodal datasets for training (Science, Chart, Geometry, Commonsense domains)
MMMU dataset used as the meta (validation) set for upper-level optimization

Key Hyperparameters:

lower_level_optimizer: AdamW (lr=5e-7)
upper_level_optimizer: AdamW (lr=0.01, weight decay=1e-3)
unroll_steps: 5 (inner gradient steps per outer update)
+ 2 more
total_iterations: 10000
scheduler: StepLR (step size 5000, gamma 0.5)

Compute: 10 hours on one NVIDIA A100 GPU

Comparison to Prior Work

vs. Vanilla PRM: DreamPRM dynamically reweights datasets based on quality rather than treating all data equally.
vs. s1-PRM / CaR-PRM: DreamPRM learns weights automatically via gradient descent rather than using heuristic rules like clustering or difficulty filtering.
vs. DoReMi: DreamPRM optimizes for a process reward objective (verification) rather than language modeling perplexity [not cited in paper context specifically, but conceptually distinct].
+ 1 more
vs. DOGE: Adapts bi-level reweighting specifically for the PRM inference objective (aggregation score) rather than general loss minimization.

Limitations

Depends on a high-quality meta-dataset (MMMU) which serves as the ground truth proxy for upper-level optimization.
Requires bi-level optimization which can be computationally more complex and unstable than standard training (though unroll steps mitigate this).
Evaluation focuses on mathematical and reasoning benchmarks; applicability to creative or open-ended multimodal tasks is less explored.

Reproducibility

Code: https://github.com/coder-qicao/DreamPRM

Code is publicly available at https://github.com/coder-qicao/DreamPRM. Base models (InternVL, Qwen2-VL) and datasets (MathVista, MMMU, etc.) are public. Implementation uses the Betty library for bi-level optimization.

📊 Experiments & Results

Evaluation Setup

Multimodal reasoning benchmarks using Best-of-N (N=8) inference.

Benchmarks:

MathVista (Mathematical reasoning in visual contexts)
WeMath (Mathematical reasoning)
MathVision (Mathematical reasoning)
MMVet (General multimodal reasoning)
MMStar (General multimodal reasoning)

Metrics:

Accuracy (Top-1)
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against PRM baselines showing the effectiveness of the domain reweighting strategy.
MathVista	Accuracy	59.3	62.4	+3.1
MathVista	Accuracy	60.5	62.4	+1.9
Scaling results with stronger base models on the MathVista leaderboard.
MathVista	Accuracy	80.6	85.2	+4.6
MathVista	Accuracy	49.9	62.4	+12.5

Experiment Figures

MathVista leaderboard snapshot highlighting DreamPRM's top ranking.

Scaling behavior of DreamPRM with respect to number of candidates (Left) and base model strength (Right).

Main Takeaways

Domain reweighting is crucial: DreamPRM consistently outperforms vanilla training and heuristic selection methods, showing that not all multimodal data is equal.
Scales with compute: Performance improves monotonically as the number of candidate solutions (N) increases from 2 to 8.
Model agnostic: The PRM trained with DreamPRM improves results for multiple base models, including InternVL, GPT-4o-mini, and o4-mini.
Bi-level optimization works: Ablations confirm that both the BLO framework and the specific aggregation loss function are necessary for maximum performance.

📚 Prerequisite Knowledge

Prerequisites

Process Reward Models (PRM)
Bi-level Optimization (BLO)
Monte Carlo Estimation
Multimodal Large Language Models (MLLMs)

Key Terms

PRM: Process Reward Model—a model that scores the correctness of intermediate steps in a reasoning chain, rather than just the final answer.

Bi-level Optimization: An optimization framework with two nested problems: an outer (upper) loop optimizing hyperparameters (here, domain weights) and an inner (lower) loop optimizing model parameters.

Monte Carlo Estimation: A method to estimate the correctness of an intermediate step by rolling out multiple future completions and checking how many lead to the correct final answer.

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer.

MathVista: A benchmark dataset for evaluating mathematical reasoning in multimodal models.

Best-of-N: An inference strategy where N candidate solutions are generated, and a reward model selects the best one.

Domain Reweighting: Assigning different importance scalar weights to different datasets (domains) during training to balance their influence.

Aggregation Function: A function (e.g., product, min, or sum) that combines step-level reward scores into a single trajectory-level score.