Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

📝 Paper Summary

LLM Post-training Reinforcement Learning (RL) Data Selection / Active Learning

When fine-tuning language models with GRPO under strict data budgets, selecting the hardest examples—those where the base model frequently fails—dramatically outperforms random or easy selection strategies.

Core Problem

Acquiring high-quality supervision data for LLM post-training is expensive, and it is unclear which subset of examples maximizes performance when annotation budgets are limited.

Why it matters:

Practical budgets often limit fine-tuning to a small fraction of available prompts, making selection strategy critical for ROI
Prior work lacks a systematic comparison of difficulty-based selection for group-based RL methods like GRPO
Inefficient data selection wastes compute on examples that provide zero learning signal (because the model already solves them deterministically)

Concrete Example: In GSM8K math problems, an 'easy' example might be a simple addition problem the model always gets right. Training on this yields zero reward variance (all correct) and thus zero gradients in GRPO. In contrast, a 'hard' multi-step problem yields mixed success/failure across group samples, providing the necessary contrastive signal for learning.

Key Novelty

Difficulty-Targeted GRPO Data Selection

Estimate prompt difficulty by sampling K responses from the base model and calculating the success rate (pass@K)
Select training subsets based on this difficulty: Hard (lowest pass rate), Easy (highest), Medium, or Random
Demonstrate that 'Hard' examples (or 'Base Wrong' examples) are the only ones providing sustained gradient signal because they maintain outcome variance longer than easy ones

Evaluation Highlights

Training on the hardest 10% of examples improves Qwen3-14B accuracy on GSM8K by +39.42 percentage points, compared to just +8.26pp for easy examples
Hard-example training is the only strategy yielding meaningful gains (+20% relative) on the out-of-distribution AIME2025 benchmark, while easy/random strategies show zero or negative transfer
Training exclusively on 'Base Wrong' examples (accuracy < 25%) consistently outperforms training on 'Base Right' examples by ~14% on average, even when the 'Base Right' set is significantly larger

Breakthrough Assessment

8/10

Provides a highly actionable, counter-intuitive finding for the specific (but popular) GRPO algorithm. The magnitude of improvement (>30pp gap) is significant, though the scope is limited to reasoning tasks.

⚙️ Technical Details

Problem Definition

Setting: Budget-constrained offline reinforcement learning fine-tuning for reasoning tasks

Inputs: A large pool of unlabeled prompts X and a budget constraint to select only p=10% for training

Outputs: A fine-tuned policy π_theta optimized for reasoning accuracy

Pipeline Flow

Difficulty Estimation (Sample K completions per prompt -> Compute pass rates)
Subset Selection (Filter top 10% hardest/easiest/etc.)
GRPO Training (Fine-tune model on selected subset)
Evaluation (Test on held-out in-distribution and OOD benchmarks)

System Modules

Difficulty Estimator

Compute empirical difficulty of prompts

Model or implementation: Base Model (Qwen/Phi/Llama)

Policy Model

Generate responses and update weights via GRPO

Model or implementation: Qwen3-4B, Qwen3-14B, Phi-4, or Llama3.1-8B

Novel Architectural Elements

Integration of offline difficulty-based filtering directly into the GRPO data pipeline (methodological novelty rather than model architecture change)

Modeling

Base Model: Qwen3-4B, Qwen3-14B, Phi-4, Llama3.1-8B (Instruct versions)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward using importance sampling with clipping, normalized by group statistics.

Formally: Standard GRPO objective using group-normalized advantages A_i = (r_i - mean(r_group)) / std(r_group).

Adaptation: Full fine-tuning (implied, no LoRA mentioned in core text)

Training Data:

GSM8K (7473 problems)
BIG-Bench Hard Tracking Shuffled Objects (250 problems)
Subsets created by selecting top/bottom 10% based on difficulty

Key Hyperparameters:

group_size: Not explicitly reported in text (likely standard GRPO default)
learning_rate: Not explicitly reported in text
sampling_temperature: 1.0 (for difficulty estimation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard GRPO: Adds a pre-processing step to filter data by difficulty rather than using the whole dataset
vs. Online PPO: Uses group-based variance rather than a learned value function, making difficulty selection more critical (as zero variance = zero learning)
vs. Rejection Sampling [not cited in paper]: Focuses on 'hard' failures for RL, whereas RFT typically focuses on 'correct' successes for SFT
+ 1 more
vs. PRIME [cited]: Focuses on offline outcome-based selection rather than online process-reward guidance

Limitations

Evaluation limited to reasoning tasks (Math, Logic); may not apply to creative writing or summarization
Depends on a 'gold' answer being available to compute difficulty (requires labeled data for the selection phase)
Computational cost of 'difficulty estimation' (generating K samples per prompt) is high before training starts
Limited exploration of hyperparameters (group size, KL penalty) interacting with difficulty selection

Reproducibility

Code is publicly available at https://github.com/EternisLabs/hard-examples-grpo. Hyperparameters for GRPO (LR, beta, etc.) are mentioned as being in the appendix (but appendix text not provided in input). Datasets are standard public benchmarks.

📊 Experiments & Results

Evaluation Setup

Offline selection of 10% data subset followed by GRPO fine-tuning

Benchmarks:

GSM8K (Grade school math reasoning)
Tracking Shuffled Objects (BBH) (State tracking / Logic)
AIME2025-I (OOD Advanced Math Competition)

Metrics:

Accuracy (Exact Match)
Learnable Percentage (fraction of steps with non-zero reward variance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing different difficulty selection strategies (Hard, Easy, Medium, Random) on GSM8K accuracy gains.
GSM8K	Accuracy Gain (pp)	0.0	39.42	+39.42
GSM8K	Accuracy Gain (pp)	0.0	8.26	+8.26
GSM8K	Accuracy Gain (pp)	0.0	37.3	+37.3
Out-of-distribution generalization results on the harder AIME2025 benchmark.
AIME2025-I	Relative Improvement	0.0	20.0	+20.0
Analysis of 'Base Wrong' vs 'Base Right' training splits.
GSM8K	Relative Improvement	0.0	23.5	+23.5

Experiment Figures

Learning curves (accuracy vs training steps) for different selection strategies.

Correlation plot between '% Learnable' (x-axis) and Performance Improvement (y-axis).

Main Takeaways

Hard Examples >> Easy Examples: There is a massive gap (>30pp) in effectiveness between training on hard vs. easy examples for GRPO.
Mechanism is Reward Variance: Hard examples maintain 'learnability' (non-zero reward variance) for much longer; easy examples are solved quickly, causing gradients to vanish.
Base-Wrong Heuristic: Simply selecting examples the base model gets wrong is a highly effective proxy for 'Hard' examples, avoiding complex pass@k estimation.
OOD Generalization: Only models trained on hard examples showed robust transfer to the much harder AIME2025 benchmark; others stagnated or regressed.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF) concepts
Group Relative Policy Optimization (GRPO) mechanics
Pass@k metrics for evaluating LLM reasoning

Key Terms

GRPO: Group Relative Policy Optimization—a PPO-style algorithm that normalizes advantages within a group of sampled outputs for the same prompt, removing the need for a separate value function

Pass@k: A metric estimating the probability that at least one of k generated samples is correct

Outcome variance: The variation in rewards (correct vs. incorrect) within a group of samples; essential for GRPO to calculate non-zero advantages

Learnable percentage: The fraction of training steps where the within-group reward standard deviation is non-zero, indicating that the model can actually learn from that batch

Base Wrong: Examples where the base model (before fine-tuning) achieves < 25% accuracy across multiple samples

Base Right: Examples where the base model achieves >= 25% accuracy

Chain-of-Thought: Prompting technique where the model generates intermediate reasoning steps before the final answer