GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Reinforcement Learning (RL) for Reasoning

GRPO-CARE improves multimodal reasoning by replacing strict KL penalties with a group-relative consistency bonus that rewards reasoning chains that are both accurate and logically aligned with a stable reference model.

Core Problem

Standard outcome-supervised RL (like GRPO) improves final answer accuracy but often degrades reasoning coherence, as models find shortcut solutions that are correct but logically inconsistent.

Why it matters:

Optimization for final answers alone encourages 'Thought Collapse' or shortcut learning, where reasoning does not actually support the conclusion
Strict KL divergence penalties in standard RL overly constrain exploration, preventing the model from finding new, valid reasoning paths that differ from the pre-trained prior
Existing benchmarks for MLLM post-training lack rigorous generalization tiers (in-distribution vs. out-of-distribution) needed to evaluate true reasoning robustness

Concrete Example: In a video task, a standard GRPO model correctly answers 'hit ball with club' but its reasoning chain confusingly suggests 'move the ball to the golf tee', contradicting the final action. GRPO-CARE aligns the reasoning to correctly identify the 'hit' action dynamics.

Key Novelty

Consistency-Aware Reward Enhancement (CARE) without process supervision

Replaces the standard KL divergence penalty with an adaptive consistency bonus derived from a slowly updating reference model (EMA)
Calculates a 'reasoning-to-answer' likelihood score: the reference model checks if the generated reasoning trace logically leads to the correct answer
Applies a sparse reward bonus only to samples that are both accurate and demonstrate higher logical consistency than their group peers

Architecture

The GRPO-CARE framework pipeline showing the dual-reward mechanism (Outcome Reward + Consistency Bonus) and the reference model interaction.

Evaluation Highlights

+6.7% accuracy improvement on the hardest out-of-distribution level (Level-3) of SEED-Bench-R1 compared to standard GRPO
+24.5% increase in reasoning-answer consistency rate compared to standard GRPO
Achieves strong transfer performance on general video benchmarks like MVBench (+3.6%) and EgoPlan (+3.4%)

Breakthrough Assessment

8/10

Significant methodology improvement for MLLM post-training by addressing the 'correct answer, wrong reasoning' problem without expensive process supervision. Also contributes a substantial, hierarchically structured benchmark.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Reinforcement Learning (Post-training) for Video Understanding

Inputs: Video frames V and a natural language question q

Outputs: A reasoning chain (Chain of Thought) followed by a final answer a

Pipeline Flow

Input Processing (Video + Question)
Policy Sampling (Generate G responses)
Consistency Evaluation (Reference Model)
Reward Calculation (Outcome + Consistency Bonus)
Policy Update (GRPO)

System Modules

Policy Model

Generate G responses (reasoning + answer) for a given input

Model or implementation: Qwen2.5-VL-Instruct-7B

Reference Model

Estimate likelihood of the generated answer given the generated reasoning trace to measure coherence

Model or implementation: Same architecture as Policy Model, weights updated via EMA

Reward Engine

Compute total reward based on correctness and consistency

Model or implementation: Rule-based + Statistical

Novel Architectural Elements

Dual-reward mechanism combining outcome supervision with a reference-model-based consistency bonus
Removal of the standard per-token KL divergence penalty in the loss function, replacing it with the reward-based consistency bonus

Modeling

Base Model: Qwen2.5-VL-Instruct-7B

Training Method: GRPO-CARE (Group Relative Policy Optimization with Consistency-Aware Reward Enhancement)

Objective Functions:

Purpose: Optimize policy to maximize expected reward without drifting too far, using group normalization.

Formally: Standard GRPO objective but with KL term removed and reward R modified to R_outcome + R_consistency.

Training Data:

SEED-Bench-R1 Training Set (50k samples total, 6k used for pilot study)
Data derived from Epic-Kitchens videos with automatically constructed Q&A pairs

Key Hyperparameters:

learning_rate: 1e-6
batch_size: Not reported in the paper
group_size_G: 8
+ 3 more
beta: 0.04 (KL coefficient equivalent)
consistency_bonus_coefficient: 0.5
EMA_decay_rate: 0.99

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard GRPO: Removes KL penalty; adds consistency bonus from EMA reference model
vs. SFT: Uses RL for self-improvement; achieves better OOD generalization
vs. PRM: Does not require dense step-level supervision labels
+ 1 more
vs. Video-R1: Focuses on structured generalization levels (L1/L2/L3) and consistency rather than just general performance

Limitations

Dependency on the quality of the initial SFT model to generate valid reasoning traces
Computational overhead of maintaining and querying the EMA reference model during training
Reward mechanism relies on ground truth answers, limiting applicability to open-ended generation without clear gold standards

Reproducibility

Code: https://github.com/TencentARC/SEED-Bench-R1

Publicly available: Code and SEED-Bench-R1 data (https://github.com/TencentARC/SEED-Bench-R1). Missing: Exact training compute time/resources (GPU hours).

📊 Experiments & Results

Evaluation Setup

Video Question Answering with a focus on reasoning and next-action prediction

Benchmarks:

SEED-Bench-R1 (Video QA / Action Anticipation) [New]
MVBench (General Video Understanding)
EgoPlan (Egocentric Video Planning)

Metrics:

Accuracy (%)
Consistency Rate (%) (Logic coherence between reasoning and answer)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on SEED-Bench-R1 showing GRPO-CARE's superior performance across all difficulty levels compared to SFT and standard GRPO.
SEED-Bench-R1 (Level-1 / In-Distribution)	Accuracy	69.1	71.4	+2.3
SEED-Bench-R1 (Level-2 / Cross-Environment)	Accuracy	63.7	68.6	+4.9
SEED-Bench-R1 (Level-3 / Cross-Env-Task)	Accuracy	59.5	66.2	+6.7
SEED-Bench-R1	Consistency Rate	57.9	82.4	+24.5
Transfer learning capabilities demonstrating that models trained with GRPO-CARE generalize well to other established benchmarks.
MVBench	Accuracy	52.8	56.4	+3.6
EgoPlan	Accuracy	47.2	50.6	+3.4

Experiment Figures

Qualitative comparison of reasoning chains between SFT, GRPO, and GRPO-CARE.

Main Takeaways

GRPO-CARE significantly outperforms standard GRPO and SFT, especially in out-of-distribution scenarios (Level-3), proving better generalization.
The method effectively resolves the 'Thought Collapse' issue, where models give correct answers for wrong reasons, boosting consistency by 24.5%.
Replacing the KL penalty with a consistency bonus allows for more effective exploration of reasoning paths.
The approach shows strong transferability to general video understanding tasks beyond the specific training benchmark.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with KL divergence constraints (e.g., PPO, GRPO)
Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) reasoning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same input to reduce variance

CoT: Chain of Thought—intermediate reasoning steps generated by a model before the final answer

EMA: Exponential Moving Average—a technique to update model weights slowly over time to create a stable reference model

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how much one probability distribution differs from another; often used as a penalty to prevent model drift

SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning

OOD: Out-of-Distribution—data that differs significantly from the training set (e.g., unseen environments or tasks)

Process Supervision: Training signals provided at each step of reasoning, rather than just for the final outcome

Sparse Bonus: A reward given only to a subset of high-performing samples (e.g., those above a threshold), rather than to all samples