Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

📝 Paper Summary

Mathematical Reasoning RLHF / RLVR (Reinforcement Learning with Verifiable Rewards)

Training reasoning models on a mix of easy and hard problems prevents runaway verbosity by teaching the model that rewards can be obtained with concise solutions.

Core Problem

Standard RLVR pipelines filter out 'easy' problems to maximize gradient signals, leaving only hard problems that require long reasoning chains. This biases models to believe 'longer is better,' leading to excessive verbosity and increased inference costs.

Why it matters:

Inference latency and compute costs for 'thinking' models (like o1 or R1) are prohibitively high due to extremely long generation sequences.
Existing methods rely on explicit length penalties which can degrade performance, whereas this approach uses data distribution to implicitly regularize length.
Models conflate verbosity with reasoning quality, often generating redundant tokens just to reduce uncertainty via longer prefixes.

Concrete Example: In standard GRPO (Group Relative Policy Optimization), a group of rollouts on an easy problem (1+1=2) yields zero advantage if all are correct, so these are dropped. The model only trains on hard geometry problems requiring 50 steps. Consequently, when later asked a medium-difficulty algebra question, the model hallucinates a 50-step derivation because it has learned that only long answers get rewards.

Key Novelty

Implicit Length Regularization via Easy Samples

Retains and upweights moderately easy problems (success rate < 1) during RLVR training instead of filtering them out.
Exposes the policy to short, solvable trajectories that yield positive rewards, counterbalancing the bias toward long chains from hard problems.
Demonstrates that verbosity can be curbed purely through data curation without modifying the reward function or loss objective.

Architecture

Success rate distribution vs. token length. Left: Token length increases with difficulty. Right: Bimodal success rate distribution (p=0 and p=1) with a sparse intermediate region.

Evaluation Highlights

Frugal-Thinking-30B-A3B achieves 70.0% Pass@1 on AIME25, significantly outperforming QwQ-32B-Preview (50.0%).
Reduces average solution length by nearly 2x compared to the initial verbose policy while maintaining baseline accuracy on the AIME25 benchmark.
Achieves 92.2% Pass@1 on MATH-500 with the 30B model, surpassing the QwQ-32B-Preview baseline of 90.6%.

Breakthrough Assessment

7/10

Simple yet counter-intuitive finding that contradicts standard RLVR efficiency heuristics. Offers a practical, 'free' method to reduce inference costs for reasoning models, though the core innovation is data-centric rather than architectural.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) on mathematical queries

Inputs: Mathematical query x

Outputs: Reasoning chain z followed by boxed final answer y

Pipeline Flow

Input Query -> Policy Model (Qwen3-Thinking) -> Auto-regressive Generation (Reasoning + Answer) -> Output Verification

System Modules

Policy Model

Generate step-by-step reasoning and final answer

Model or implementation: Qwen3-4B-Thinking-2507 or Qwen3-30B-A3B-Thinking-2507

Modeling

Base Model: Qwen3-4B-Thinking-2507 (Dense) and Qwen3-30B-A3B-Thinking-2507 (MoE)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: GRPO objective maximizing advantage A_i calculated relative to group mean reward.
Purpose: Penalize deviation from reference policy.

Formally: KL divergence term subtracted from reward.

Adaptation: Full fine-tuning (implied by RLVR on base models)

Training Data:

Stage 1: Internal math dataset (Ji et al.) retaining easy/medium samples (success rate p < 1).
Stage 2: Subset of DeepMath-103 filtered for difficulty (14.5k samples), excluding unsolvable and trivial cases.

Key Hyperparameters:

learning_rate: 5e-7 (30B model) / 1e-6 (4B model)
batch_size: 256 (global)
group_size: 8 (number of outputs per prompt)
+ 3 more
beta (KL coeff): 0.01 (30B) / 0.04 (4B)
clip_ratio: 0.2
max_completion_length: 16384

Compute: Training done on H800 GPUs (number not specified explicitly, but mentioned H800 cluster usage implied)

Comparison to Prior Work

vs. Standard RLVR: Retains 'easy' samples (p approx 1) to teach conciseness naturally rather than filtering them out for efficiency.
vs. Explicit Length Penalty: Achieves brevity via data distribution rather than reward shaping, avoiding the trade-off where models cut reasoning short prematurely.

Limitations

Requires a two-stage training process (Emergent Brevity then Curriculum) which may be more complex to tune.
Implicit regularization relies on the specific mix of easy/hard data; the optimal ratio is not theoretically derived.
Evaluation is limited to math benchmarks; generalization to coding or general reasoning is not extensively tested.

Reproducibility

Code: https://hf.co/collections/MBZUAI-Paris/frugal-thinking

Models and code are publicly available at https://hf.co/collections/MBZUAI-Paris/frugal-thinking. Dataset filtering logic is detailed. Exact compute hours are not reported.

📊 Experiments & Results

Evaluation Setup

Evaluated on mathematical reasoning benchmarks using exact match verification.

Benchmarks:

AIME25 (Competition Math (Integer Answer))
MATH-500 (General Math Problems)
Omni-MATH-Hard (Olympiad-level Math)
GPQA-Diamond (Graduate-Level Science QA)

Metrics:

Pass@1 Accuracy
Average Output Length (Tokens)
Efficiency Adjusted Accuracy (EAA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of Frugal-Thinking models compared to baselines on key math benchmarks. Note: Baselines like QwQ-32B are significantly larger or specialized.
AIME25	Pass@1	50.0	70.0	+20.0
MATH-500	Pass@1	90.6	92.2	+1.6
AIME25	Pass@1	19.3	53.3	+34.0
AIME25	Avg Output Length (Tokens)	7665	15462	+7797

Experiment Figures

Training dynamics of the 4B model during Stage 1 RLVR.

Main Takeaways

Including 'easy' problems in RLVR training acts as a powerful length regularizer, curbing the tendency of reasoning models to become overly verbose.
The method (Stage 1 'Emergent Brevity' + Stage 2 Curriculum) allows models to achieve high accuracy (70% on AIME25 for 30B) without wasting inference budget on redundant tokens.
Training dynamics show a sharp decrease in response length and clipping ratio when easy samples are introduced, stabilizing entropy early in training.
The Frugal-Thinking-30B model outperforms the QwQ-32B-Preview baseline on AIME25 by a large margin (+20%) while maintaining comparable or better performance on other benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Large Language Models (LLMs) and Chain-of-Thought (CoT)
Proximal Policy Optimization (PPO) concepts

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using outcomes (like correct answers) rather than human preference labels

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt against their group average, removing the need for a separate value function

Pass@1: The percentage of problems where the model's first generated answer is correct

MoE: Mixture of Experts—a model architecture where only a subset of parameters (experts) are active for each token, improving efficiency

EAA: Efficiency Adjusted Accuracy—a proposed metric that penalizes correct answers based on their token length relative to a maximum budget