AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

📝 Paper Summary

Multimodal Reasoning Reinforcement Learning with Verifiable Rewards (RLVR) Process Supervision

AutoRubric-R1V stabilizes multimodal reasoning training by automatically distilling consistent reasoning steps from the model's own successful trajectories into rubrics that reward correct intermediate processes.

Core Problem

Reinforcement Learning with Verifiable Rewards (RLVR) typically rewards only the final answer correctness, encouraging models to learn shortcuts or 'spurious reasoning' where flawed logic accidentally yields the right result.

Why it matters:

Models trained only on outcomes often fail to generalize because they learn to 'hack' the reward rather than reason correctly
Existing process supervision methods rely on expensive human annotation or proprietary teacher models (e.g., GPT-4), which are costly and limited by the teacher's capability
Spurious reasoning undermines reliability, as models may generate contradictory intermediate steps that confuse users even if the final answer is correct

Concrete Example: In a geometry problem, a model might define a side length incorrectly (e.g., conflating BC with CD) but still arrive at the correct numerical answer due to canceling errors. Standard RLVR rewards this trajectory fully, reinforcing the logical error.

Key Novelty

Self-Aggregated Rubric Generation for Generative Rewards

Instead of external supervision, the method samples multiple trajectories from the model itself and filters for correct answers
An LLM compares these successful trajectories to identify 'reasoning checkpoints'—steps that appear consistently across majority of correct solutions—filtering out random or spurious steps
These distilled checkpoints form a problem-specific rubric used by a judge model to reward intermediate steps during RL training

Evaluation Highlights

+7.52% average accuracy improvement across 6 multimodal reasoning benchmarks compared to the Qwen-2.5-VL-7B base model
Achieves an average score of 54.81 on reasoning benchmarks, comparable to the much larger Qwen-2.5-VL-72B model (55.57)
Substantially reduces reasoning inconsistency (unfaithful reasoning steps) compared to standard RLVR training in MathVerse evaluations

Breakthrough Assessment

8/10

Offers a scalable, self-contained solution to the reward hacking problem in reasoning models without requiring external human annotations or stronger teachers. Significant performance gains matching 10x larger models.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Chain-of-Thought Reasoning with Reinforcement Learning

Inputs: Visual input V, Textual query Q

Outputs: Reasoning trace s_{1:T} ending with final answer a

Modeling

Base Model: Qwen2.5-VL-7B-IT

Training Method: Group Relative Policy Optimization (GRPO) with Rubric-based Rewards

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: GRPO objective maximizing importance ratio * Advantage - KL penalty.
Purpose: Calculate total reward for a trajectory.

Formally: r_i = r_ans + lambda * r_rubric, where r_ans is outcome correctness and r_rubric is the fraction of rubric checkpoints satisfied.

Training Data:

ViRL-39K dataset (Wang et al., 2025a)
Rubrics generated for 26,144 samples (67.3% coverage) by aggregating 8 trajectories per sample using an open-source LLM

Key Hyperparameters:

learning_rate: 1e-6
epochs: 3
global_batch_size: 128
+ 4 more
rollout_batch_size: 512
rollout_number: 8
sampling_temperature: 1.0
kl_coefficient: 0.01

Compute: 8 H100 GPUs

Comparison to Prior Work

vs. R1-VL: AutoRubric generates descriptive sentence-level rubrics (avg 23 words/criterion) vs. R1-VL's short keywords (2.9 words/criterion), enabling more semantic checking
vs. PRMs: Avoids training a separate reward model that may suffer from distribution shift; uses in-context learning with an LLM judge instead
vs. Vision-R1: Distills supervision from the model's *own* consistent correct trajectories rather than relying on a stronger teacher model

Limitations

Rubric generation requires the model to be capable of generating at least some correct trajectories initially (67.3% coverage reported)
Relies on the capability of the Judge LLM to correctly interpret the rubric and reasoning
Computational cost of sampling 8 trajectories per sample for rubric construction

Reproducibility

Code: https://github.com/Jill0001/AutoRubric-R1V

Code and rubric dataset to be released at https://github.com/Jill0001/AutoRubric-R1V. Base model Qwen2.5-VL-7B-IT is open. Uses 'gpt-oss-20b' (judge) and 'gpt-oss-120b' (rubric construction) mentioned in footnotes 3 and 4, which appear to be placeholders or specific open weights hosted on HuggingFace.

📊 Experiments & Results

Evaluation Setup

Multimodal reasoning across diverse domains (General, Math, Geometry)

Benchmarks:

MMMU (General multimodal reasoning)
MathVista (Multimodal mathematical reasoning)
MathVerse (Geometric and mathematical reasoning)
MMMU-Pro (Advanced multimodal reasoning)
MATH-Vision (Visual math problems)
WeMATH (Math reasoning)

Metrics:

Accuracy (Average across benchmarks)
Inconsistency Rate (Faithfulness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows AutoRubric-R1V significantly outperforming the base model and achieving parity with much larger models.
Average (6 benchmarks)	Accuracy	47.29	54.81	+7.52
Average (6 benchmarks)	Accuracy	55.57	54.81	-0.76
Average (6 benchmarks)	Accuracy	52.96	54.81	+1.85

Experiment Figures

Training dynamics comparing AutoRubric-R1V vs Vanilla GRPO across Answer Reward, Rubric Reward, and Response Length.

Case study comparing two trajectories (one with logical errors, one correct) and the rubric scoring.

Main Takeaways

Rubric-based rewards stabilize training: Unlike Vanilla RLVR which oscillates/degrades due to reward hacking, AutoRubric shows steady improvement in reward curves.
Problem-specific rubrics are essential: The 'w/o Rubric' ablation (judge without specific criteria) performed significantly worse, comparable to Vanilla RLVR, proving that generic 'is this good?' prompts are insufficient.
Self-aggregation works: Generating rubrics from the model's own correct trajectories (test-time scaling intuition) provides high-quality supervision without human labels.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) prompting

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using ground-truth answers (like math solutions) as the primary reward signal

GRPO: Group Relative Policy Optimization—a PPO variant that estimates advantages by normalizing rewards across a group of outputs for the same input, removing the need for a value function

Process Supervision: Providing feedback on intermediate reasoning steps rather than just the final outcome

Rubric-based Generative Rewards: Using an LLM to evaluate a response against a structured list of criteria (rubrics) to generate a scalar reward score

Self-Aggregation: The process of collecting multiple outputs from the model itself and synthesizing the common consistent elements into a supervision signal

Spurious Reasoning: Reasoning trajectories that contain logical errors or shortcuts but still accidentally arrive at the correct final answer