Stable and Efficient Single-Rollout RL for Multimodal Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Multimodal Large Language Models (MLLMs)

MSSR enables stable and compute-efficient reinforcement learning for multimodal models by using a single rollout per input with entropy-based advantage shaping to prevent optimization collapse.

Core Problem

Existing multimodal RLVR methods like GRPO require multiple rollouts per input for stability, which is computationally expensive and inefficient when rollouts yield identical rewards.

Why it matters:

Group-based methods (e.g., GRPO) require repeated forward passes through large vision and language encoders, creating a substantial compute bottleneck.
Naive single-rollout strategies from text-only RL fail in multimodal settings due to high variance from visual inputs, leading to entropy collapse and training instability.
When all rollouts in a group have the same outcome (all correct/incorrect), relative advantage collapses to zero, wasting computation.

Concrete Example: In a visual math problem, a single-rollout policy might produce a correct answer (reward=1) followed by an incorrect one (reward=0) for similar inputs due to visual noise. Without the group normalization used in GRPO, this high variance causes the policy entropy to plummet (entropy collapse), as seen in the MVSR baseline where accuracy degrades rapidly during training.

Key Novelty

Multimodal Stabilized Single-Rollout (MSSR)

Generalizes text-only single-rollout RL to multimodal settings by modeling binary rewards as Bernoulli variables and estimating the baseline using a Beta distribution.
Introduces entropy-based advantage shaping that adds a scaled entropy bonus to the advantage, softening penalties for high-uncertainty responses.
Dynamically adapts the discount factor for the baseline estimate based on the KL divergence between consecutive policy updates to balance stability and adaptation.

Architecture

Overview of the MSSR framework compared to standard single-rollout (MVSR). It illustrates the flow from multimodal input to single rollout, reward calculation, Beta baseline estimation, and the critical addition of entropy-based advantage shaping.

Evaluation Highlights

Achieves GRPO's final validation accuracy with 50% of the training steps (Figure 1), demonstrating superior compute efficiency.
Outperforms the group-based GRPO baseline by an average of 2.1% (3B model) and 2.3% (7B model) across five multimodal reasoning benchmarks.
Prevents the entropy collapse observed in naive single-rollout baselines (MVSR), maintaining exploration and training stability throughout optimization.

Breakthrough Assessment

8/10

Successfully transfers single-rollout efficiency to the challenging multimodal domain, resolving a key stability bottleneck (entropy collapse) that previously mandated expensive group-based methods.

⚙️ Technical Details

Problem Definition

Setting: Multimodal reasoning with binary verifiable rewards

Inputs: Multimodal input x = (text prompt, image)

Outputs: Structured reasoning output o enclosed in <think> tags with final answer in \boxed{}

Pipeline Flow

Policy Model (Qwen2.5-VL) generates single rollout o
Reward Function evaluates correctness r(x,o)
Baseline Estimator updates Beta distribution parameters
Advantage Calculator computes shaped advantage with entropy bonus

System Modules

Multimodal Policy

Generate reasoning chain and answer from multimodal input

Model or implementation: Qwen2.5-VL-3B or 7B

Reward Function

Check if final answer matches ground truth

Model or implementation: Rule-based checker

Baseline Estimator (Advantage Estimation)

Estimate expected reward to compute advantage

Model or implementation: Beta Distribution B(alpha, beta)

Advantage Shaper (Advantage Estimation)

Modify raw advantage with entropy bonus to prevent collapse

Model or implementation: Mathematical formula

Novel Architectural Elements

Integration of token-level entropy bonus directly into the advantage term for single-rollout multimodal RL
Adaptive discount factor for Beta-distributed baseline controlled by KL divergence stability

Modeling

Base Model: Qwen2.5-VL-3B and Qwen2.5-VL-7B

Training Method: Single-rollout Policy Gradient (MSSR)

Objective Functions:

Purpose: Maximize expected reward with entropy regularization.

Formally: Policy gradient on shaped advantage A_hat = A + psi_t

Training Data:

Vision-R1-RL dataset (approx. 10K samples)
Includes real-world images (charts, diagrams, visual math)

Key Hyperparameters:

learning_rate: 1e-6
weight_decay: 0.01
batch_size: Not reported in the paper
+ 7 more
training_steps: 120
entropy_shaping_gamma: 0.4
entropy_shaping_lambda: 2.0
discount_factor_eta_min: 0.875
discount_factor_eta_max: 0.96
kl_target: 0.01
kl_reg_coefficient: 0.01

Compute: Trained on 8 GPUs

Comparison to Prior Work

vs. GRPO: Uses 1 rollout instead of G, replaces group normalization with Beta baseline + entropy shaping.
vs. MVSR: Adds entropy-based advantage shaping to prevent collapse.
vs. Text-only Single-Rollout: Adapts baseline estimation and stabilization specifically for high-variance multimodal inputs.

Limitations

Requires verifiable binary rewards (e.g., math/logic problems), limiting applicability to open-ended tasks.
Performance depends on the quality of the Vision-R1-RL dataset.
The method introduces additional hyperparameters (gamma, lambda) for the entropy bonus that need tuning.

Reproducibility

Code: https://github.com/RuiLie/MSSR

Publicly available code (https://github.com/RuiLie/MSSR). Built on EasyR1 framework. Uses public Vision-R1-RL dataset. Hyperparameters for entropy shaping and discount factors are explicitly provided.

📊 Experiments & Results

Evaluation Setup

Multimodal reasoning on mathematical and general-domain tasks

Benchmarks:

MathVerse (Multimodal Math Reasoning)
MathVista (Visual Math Reasoning)
MMK12 (K-12 Education)
R1-Onevision-Bench (General Multimodal Reasoning)
HallusionBench (Visual Hallucination/Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MSSR consistently outperforms baselines on 3B model scale across 5 benchmarks.
MathVerse	Accuracy	46.2	47.8	+1.6
MathVista	Accuracy	57.3	58.9	+1.6
MMK12	Accuracy	45.8	48.1	+2.3
MSSR maintains superiority on 7B model scale.
MathVerse	Accuracy	52.3	53.9	+1.6
MMK12	Accuracy	51.1	55.0	+3.9

Experiment Figures

Analysis of entropy collapse: Plots of policy entropy over training steps for MVSR vs MSSR.

Main Takeaways

MSSR achieves comparable validation accuracy to GRPO with only half the training steps, indicating significantly better sample and compute efficiency.
Naive single-rollout (MVSR) suffers from severe training instability and performance degradation, confirming the necessity of the proposed stabilization techniques.
Ablation studies show that entropy-based advantage shaping is more effective than alternative regularizers like KL regularization or cross-modal text-only branches.
Generalization improvements are consistent across diverse benchmarks (math, general reasoning, hallucination) and model scales (3B, 7B).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Policy Gradient methods (PPO, GRPO)
Multimodal Large Language Models (MLLMs)
Beta distribution for binary outcomes

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL using objective correctness signals (e.g., math answers) rather than human feedback

GRPO: Group Relative Policy Optimization—a prevalent RLVR method that samples a group of outputs per input and normalizes rewards within that group to reduce variance

MSSR: Multimodal Stabilized Single-Rollout—the proposed method using one rollout per input plus entropy shaping for stability

MVSR: Multimodal Vanilla Single-Rollout—a baseline single-rollout method without entropy shaping, used to demonstrate instability

entropy collapse: A failure mode where a policy becomes overly confident too quickly, losing diversity (randomness) and getting stuck in suboptimal behaviors

advantage shaping: Modifying the calculated advantage (learning signal) by adding auxiliary terms (like entropy) to guide optimization

Beta distribution: A probability distribution defined on the interval [0, 1], used here to estimate the expected probability of getting a correct reward

KL divergence: Kullback-Leibler divergence—a metric measuring how much one probability distribution differs from another, used here to track policy change