Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

📝 Paper Summary

Reinforcement Learning for LLMs Efficient Training Methods Mathematical Reasoning

GRESO improves RL training efficiency by predicting and skipping uninformative prompts (those yielding zero reward variance) before the expensive rollout stage, leveraging the temporal consistency of prompt difficulty.

Core Problem

Scaling up rollouts in RL improves model performance but introduces massive computational overhead, as many prompts yield 'zero variance' (identical rewards across all responses) and provide no learning signal.

Why it matters:

Rollout is a major bottleneck in RL training (e.g., PPO, GRPO), consuming significant GPU hours
Existing methods like Dynamic Sampling filter uninformative data only *after* generating it, wasting computation on useless samples
Static dataset pruning fails to adapt to the model's evolving capabilities during training

Concrete Example: In GRPO, if a prompt like '2+2' always yields the same reward (whether correct or incorrect) across 16 samples, the advantage is zero and no learning occurs. Standard methods generate all 16 samples before realizing this; GRESO predicts this outcome beforehand and skips generation entirely.

Key Novelty

GRPO with Efficient Selective Rollout (GRESO)

Identify 'zero-variance' prompts (prompts where all responses get identical rewards) as uninformative because they produce zero advantage signal
Leverage 'temporal consistency': prompts that were zero-variance in previous epochs are highly likely to remain so in the current epoch
Implement an online probabilistic filter that skips these prompts before rollout, using an adaptive exploration rate to occasionally re-check them

Architecture

The GRESO workflow comparing standard rollout vs. selective rollout.

Evaluation Highlights

Achieves up to 2.4x speedup in rollout time and 2.0x speedup in total training time compared to standard GRPO with Dynamic Sampling
Maintains comparable accuracy on math reasoning benchmarks (e.g., GSM8K, MATH) while processing significantly fewer uninformative prompts
Demonstrated effectiveness across multiple models, including Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B

Breakthrough Assessment

7/10

Simple yet highly effective efficiency improvement for RLVR. While primarily an engineering optimization rather than a fundamental algorithmic shift, the speedups (2x) are practically significant for scaling LLM reasoning.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Reward (RLVR) for mathematical reasoning tasks

Inputs: A set of reasoning prompts (e.g., math problems)

Outputs: Generated reasoning chains and final answers

Pipeline Flow

Prompt Selector (Online Filtering)
Generator (Rollout)
Reward Calculator
GRPO Update

System Modules

Prompt Selector

Decides whether to keep or skip a prompt based on historical reward variance

Model or implementation: Probabilistic Filter (Algorithm 1)

Generator

Generates multiple responses for selected prompts

Model or implementation: Policy Model (e.g., Qwen2.5-Math-7B)

Novel Architectural Elements

Pre-rollout filtering mechanism injected before the generation step in the standard RL loop
Adaptive batch sizing logic to dynamically request exactly enough prompts to fill the effective batch size

Modeling

Base Model: Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen2.5-Math-7B

Training Method: GRPO (Group Relative Policy Optimization) with Selective Rollout

Objective Functions:

Purpose: Maximize expected reward while staying close to reference policy.

Formally: GRPO objective maximizing averaged advantage clipped by epsilon ratio, minus KL divergence penalty.

Trainable Parameters: Full model parameters (or LoRA, though paper implies full fine-tuning for main results)

Training Data:

Datasets: GSM8K, MATH, NuminaMath-CoT
Training partition size not explicitly detailed but standard splits implied

Key Hyperparameters:

group_size_G: 16
learning_rate: 1e-6
beta (KL penalty): 0.04
+ 4 more
max_prompt_length: 512
max_response_length: 1024
delta_p (exploration step): 1%
target_zero_variance_ratio: 25%

Compute: Experiments run on H100 GPUs (specific count per run varies, e.g., 8xH100 for 7B models)

Comparison to Prior Work

vs. Dynamic Sampling: GRESO filters *before* rollout, saving compute; DS filters *after*, wasting compute
vs. Static Pruning: GRESO adapts online to changing model capabilities; Static Pruning is fixed and may discard prompts that become useful later
vs. RFT (Rejection Sampling Fine-Tuning) [not cited in paper]: RFT is offline selection for SFT; GRESO is online selection for RL

Limitations

Relies on the assumption that zero-variance prompts remain uninformative, which might not hold if a model suddenly has a 'breakthrough' on a specific type of problem (though exploration mitigates this)
Currently tailored for GRPO and reasoning tasks with binary/verifiable rewards; applicability to PPO with continuous rewards or subjective tasks (e.g., creative writing) is less clear
Requires maintaining history/state for every prompt, which could add minor memory overhead for massive datasets

Reproducibility

Code: https://github.com/Infini-AI-Lab/GRESO/

Code is publicly available at https://github.com/Infini-AI-Lab/GRESO/. Hyperparameters and algorithms are detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using verifiable rewards (correct final answer)

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging math problems)
AIME24 (Math competition)
AMC (Math competition)
Minerva Math (Math reasoning)
OlympiadBench (Math olympiad)

Metrics:

Accuracy (Pass@1)
Rollout Time
Total Training Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Efficiency results showing speedups of GRESO compared to GRPO with Dynamic Sampling (DS) on Qwen2.5-Math-7B.
Training Pipeline	Rollout Speedup	1.0	2.4	+1.4x (2.4x total)
Training Pipeline	Total Training Speedup	1.0	2.0	+1.0x (2.0x total)
Accuracy results demonstrating that GRESO maintains performance parity with the more expensive Dynamic Sampling baseline.
Average (6 Math Benchmarks)	Accuracy	61.3	61.5	+0.2
Average (6 Math Benchmarks)	Accuracy	59.3	61.5	+2.2

Experiment Figures

Temporal correlation of zero-variance prompts across training epochs.

Comparison of Accuracy vs. Rollout Overhead for Vanilla GRPO vs. Dynamic Sampling (DS).

Main Takeaways

Zero-variance prompts (all responses identical) provide no learning signal in GRPO and constitute a large portion of training data (up to 80% in late stages).
Prompt difficulty exhibits strong temporal consistency: prompts that are zero-variance in one epoch are >90% likely to remain so in the next.
GRESO successfully exploits this consistency to skip uninformative rollouts, achieving ~2x training speedups without degrading model accuracy.
Adaptive exploration is crucial: a small fraction of zero-variance prompts do become solvable later, so probabilistic rather than deterministic skipping is required.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Language Model Fine-tuning
Variance and Advantage estimation

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to compute advantages, avoiding the need for a separate value function

Rollout: The process of generating model responses (reasoning traces) for a given prompt during RL training

Zero-variance prompt: A prompt for which all sampled responses in a group receive the exact same reward, resulting in zero advantage and zero learning signal

Dynamic Sampling: A baseline method that filters out zero-variance prompts *after* rollout and resamples new prompts to fill the batch, ensuring high-quality training data but at high computational cost

Temporal consistency: The observation that the 'hardness' or 'informativeness' of a prompt (whether it yields variance) tends to persist across training epochs

Advantage: A measure of how much better a specific action is compared to the average action in that state; in GRPO, it's the normalized reward within a group