Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

📝 Paper Summary

Reinforcement Learning from Verifiable Rewards (RLVR) Efficient Reasoning Post-training Optimization

GFPO modifies Group Relative Policy Optimization by sampling more responses during training and filtering them based on length or token efficiency to produce concise reasoning chains without losing accuracy.

Core Problem

RL-trained reasoning models (like GRPO) tend to trade accuracy for excessive length—inflating response tokens significantly—even when shorter, correct reasoning paths exist.

Why it matters:

Models like DeepSeek-R1 generate responses 5x longer than necessary, increasing inference costs and latency
Length inflation is often uncorrelated with correctness; 'filler' tokens waste compute without improving reasoning quality
Standard length penalties in reward functions often fail to curb this inflation because models learn that longer chains are generally safer for maximizing reward

Concrete Example: On AIME 25, the GRPO baseline inflates response length from 10.9k tokens (SFT) to 14.8k tokens. GFPO reduces this back to 12k tokens (Token Efficiency variant) while maintaining accuracy, whereas GRPO rewards long, verbose chains even if they contain repetitive filler.

Key Novelty

Group Filtered Policy Optimization (GFPO)

Combines rejection sampling with GRPO: sample a larger group of responses (G) but only train on the top-k subset that meets a specific criteria (e.g., shortest length)
Acts as implicit reward shaping by zeroing out the advantages of rejected responses, effectively hiding verbose or inefficient chains from the policy update
Introduces 'Token Efficiency' (reward/length) as a filtering metric to prioritize brevity only when it doesn't sacrifice reward

Architecture

Conceptual comparison between GRPO and GFPO workflows. Shows how GFPO inserts a rejection sampling step before advantage calculation.

Evaluation Highlights

Reduces length inflation by 70.9% on AIME 25 and 84.6% on AIME 24 using Token Efficiency GFPO compared to GRPO baseline
Maintains statistical parity in accuracy with GRPO across 5 benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while using significantly fewer tokens
Adaptive Difficulty GFPO matches or exceeds GRPO accuracy on medium and very hard problems while reducing excess length by 47–60%

Breakthrough Assessment

8/10

Simple yet highly effective intervention for a critical problem in current reasoning models (verbosity). Demonstrates a clear Pareto improvement in the accuracy-efficiency trade-off.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from Verifiable Rewards (RLVR) for mathematical reasoning tasks

Inputs: Natural language math problem q

Outputs: Step-by-step reasoning chain o and final answer

Pipeline Flow

Policy Model (Sampling)
Filter (Rejection Sampling)
Advantage Estimation
Policy Update

System Modules

Policy Model

Generate G responses for a given question q

Model or implementation: Phi-4-reasoning (14B parameter model)

Filter

Select subset S of size k based on metric (Length or Token Efficiency)

Model or implementation: Rule-based selection

Advantage Estimator

Compute advantages for retained responses relative to the subset mean

Model or implementation: Mathematical formula

Novel Architectural Elements

Data filtration step within the RL loop: explicitly discarding valid gradient updates from 'bad' (verbose) samples rather than just down-weighting them via scalar reward
Adaptive Difficulty mechanism: Dynamically sizing the retained group k based on real-time difficulty estimates using a streaming t-digest of reward history

Modeling

Base Model: Phi-4-reasoning (14B parameters)

Training Method: Group Filtered Policy Optimization (GFPO), a variant of GRPO

Objective Functions:

Purpose: Optimize policy to maximize reward while staying close to reference model.

Formally: J_GFPO(θ) = E [ (1/G) * sum_{i=1}^G ( (1/|o_i|) * sum_{t=1}^|o_i| min(r_t A_t, clip(...)A_t) ) - β * KL ] where A_t is masked by the filter m.

Adaptation: Full fine-tuning

Trainable Parameters: 14B (Full model)

Training Data:

72k math problems selected from a larger corpus
RL training constrained to 100 steps (seeing ~6.4k problems total)

Key Hyperparameters:

learning_rate: 1e-7
batch_size: 64 (global)
kl_coefficient_beta: 0.001
+ 4 more
entropy_coefficient_gamma: 0.001
max_context_length: 32k
training_steps: 100
temperature: 1.0

Compute: 32 H100 GPUs for training

Comparison to Prior Work

vs. GRPO: GFPO adds a filtering step (top-k selection) before advantage computation to explicitly target secondary metrics like length
vs. Dr. GRPO / DAPO: These methods use loss normalization which the authors find insufficient to stop length inflation; GFPO works in tandem with normalization but adds explicit data filtering
vs. Reward Shaping: GFPO uses data selection as implicit reward shaping rather than engineering complex scalar reward functions combining accuracy and length penalties

Limitations

Token Efficiency GFPO shows higher variance in training curves compared to standard GRPO
Requires sampling more responses (G=16 or 24 vs G=8) during training, increasing training compute cost
Very aggressive filtering (low k/G ratio) yields diminishing returns on length reduction
No statistical significance tests reported for the small accuracy differences observed

Reproducibility

Code availability is not explicitly provided in the paper. The base model (Phi-4-reasoning) and datasets (AIME, GPQA, etc.) are public. The method is algorithmically simple to implement on top of existing libraries like 'verl' or 'TRL'.

📊 Experiments & Results

Evaluation Setup

Greedy decoding (or low temp sampling) for evaluation on math and coding benchmarks

Benchmarks:

AIME 2025 (Math Olympiad)
AIME 2024 (Math Olympiad)
GPQA (Graduate-Level Science QA)
Omni-MATH (Math Reasoning)
LiveCodeBench (Coding (Out-of-Distribution))

Metrics:

Pass@1 Accuracy
Average Response Length
Excess Length Reduction (ELR)
Statistical methodology: Wilcoxon signed-rank test used to assess significance of accuracy differences (found none)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AIME 2025 Results: Demonstrates that GFPO reduces length significantly compared to GRPO while maintaining comparable accuracy.
AIME 2025	Pass@1 Accuracy	72.4	69.5	-2.9
AIME 2025	Average Response Length	14800	12000	-2800
AIME 2025	Excess Length Reduction	0.0	70.9	+70.9
AIME 2024 Results: Shows even stronger length reductions on AIME 2024.
AIME 2024	Pass@1 Accuracy	77.7	76.4	-1.3
AIME 2024	Excess Length Reduction	0.0	84.6	+84.6
Generalization to Out-of-Distribution (Coding): LiveCodeBench results show GFPO prevents length inflation even on tasks not in the RL training set.
LiveCodeBench	Excess Length Reduction	0.0	79.7	+79.7
LiveCodeBench	Pass@1 Accuracy	56.7	57.0	+0.3

Experiment Figures

Pareto trade-off plots between Accuracy (y-axis) and Response Length (x-axis) for various methods.

Main Takeaways

Thinking less requires sampling more: Increasing group size G from 8 to 16/24 is crucial for finding high-quality short responses to train on.
Token Efficiency (reward/length) is the most effective metric for reducing length (up to 85% reduction) without harming accuracy.
Adaptive Difficulty GFPO outperforms static filtering on 'medium' and 'very hard' problems by intelligently allocating more exploration budget to harder queries.
The method is robust out-of-distribution, preventing length inflation on coding tasks that weren't even in the RL training data.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Language Model Reasoning (Chain-of-Thought)
Rejection Sampling

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of responses to the same prompt against their group average, removing the need for a critic model

RLVR: Reinforcement Learning from Verifiable Rewards—RL setting where the reward is based on a verifiable outcome (e.g., correct math answer) rather than a human preference model

SFT: Supervised Fine-Tuning—training on ground-truth reasoning traces before applying RL

Excess Length Reduction: A metric quantifying how much a method reduces the length increase caused by RL compared to the original SFT model baseline

Token Efficiency: A metric defined as reward divided by response length, prioritizing high-reward answers that use fewer tokens

t-digest: A probabilistic data structure for estimating quantiles (e.g., median, percentiles) from streaming data with low memory footprint

KL penalty: Kullback-Leibler divergence penalty—a regularization term ensuring the RL policy does not drift too far from the reference model

Pareto-optimal: A state where no metric (e.g., accuracy) can be improved without degrading another (e.g., length)