← Back to Paper List

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye
Eternis Labs
arXiv (2025)
RL Reasoning Benchmark

📝 Paper Summary

LLM Post-training Reinforcement Learning (RL) Data Selection / Active Learning
When fine-tuning language models with GRPO under strict data budgets, selecting the hardest examples—those where the base model frequently fails—dramatically outperforms random or easy selection strategies.
Core Problem
Acquiring high-quality supervision data for LLM post-training is expensive, and it is unclear which subset of examples maximizes performance when annotation budgets are limited.
Why it matters:
  • Practical budgets often limit fine-tuning to a small fraction of available prompts, making selection strategy critical for ROI
  • Prior work lacks a systematic comparison of difficulty-based selection for group-based RL methods like GRPO
  • Inefficient data selection wastes compute on examples that provide zero learning signal (because the model already solves them deterministically)
Concrete Example: In GSM8K math problems, an 'easy' example might be a simple addition problem the model always gets right. Training on this yields zero reward variance (all correct) and thus zero gradients in GRPO. In contrast, a 'hard' multi-step problem yields mixed success/failure across group samples, providing the necessary contrastive signal for learning.
Key Novelty
Difficulty-Targeted GRPO Data Selection
  • Estimate prompt difficulty by sampling K responses from the base model and calculating the success rate (pass@K)
  • Select training subsets based on this difficulty: Hard (lowest pass rate), Easy (highest), Medium, or Random
  • Demonstrate that 'Hard' examples (or 'Base Wrong' examples) are the only ones providing sustained gradient signal because they maintain outcome variance longer than easy ones
Evaluation Highlights
  • Training on the hardest 10% of examples improves Qwen3-14B accuracy on GSM8K by +39.42 percentage points, compared to just +8.26pp for easy examples
  • Hard-example training is the only strategy yielding meaningful gains (+20% relative) on the out-of-distribution AIME2025 benchmark, while easy/random strategies show zero or negative transfer
  • Training exclusively on 'Base Wrong' examples (accuracy < 25%) consistently outperforms training on 'Base Right' examples by ~14% on average, even when the 'Base Right' set is significantly larger
Breakthrough Assessment
8/10
Provides a highly actionable, counter-intuitive finding for the specific (but popular) GRPO algorithm. The magnitude of improvement (>30pp gap) is significant, though the scope is limited to reasoning tasks.
×