DARO: Difficulty-Aware Reweighting Policy Optimization

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Mathematical Reasoning

DARO improves mathematical reasoning in LLMs by dynamically learning loss weights for different problem difficulties during training, preventing the model from disproportionately focusing on specific difficulty levels.

Core Problem

Current RLVR methods (like GRPO and LIPO) use static weighting schemes based on empirical pass rates, causing a 'loss scale issue' where the training objective disproportionately focuses on certain difficulty levels.

Why it matters:

Static weights (e.g., variance-based) often downweight very easy or very hard samples to zero, potentially causing catastrophic forgetting of basic knowledge
Over-focusing on specific difficulty bands prevents the model from adapting as its capabilities evolve during the training process
The imbalance disrupts the exploration-exploitation trade-off, slowing down convergence and limiting final reasoning performance

Concrete Example: In GRPO, samples with a pass rate near 0 or 1 often have very low gradients. If a model finds most problems too hard (pass rate ≈ 0) or too easy (pass rate ≈ 1), the gradients vanish or become unbalanced, causing the model to ignore those samples rather than learning from the edge cases.

Key Novelty

Difficulty-Aware Reweighting Policy Optimization (DARO)

Treats groups of samples with different pass rates (difficulties) as distinct tasks in a multi-task learning framework
Introduces learnable weight parameters for each difficulty group that are optimized jointly with the model to balance the total loss contribution
Dynamically increases weights for difficulty levels where the model currently struggles (high loss) to ensure balanced training focus

Architecture

The DARO training pipeline where empirical pass rates determine base losses, which are then modulated by dynamically learnable weights.

Evaluation Highlights

Achieves highest average accuracy across 6 math benchmarks on Qwen2.5-Math-7B (50.8%), outperforming GRPO (+1.4%) and DAPO (+2.4%)
Demonstrates significantly faster convergence: Llama-3.1-8B reaches 20% pass rate in half the training steps required by DAPO
Consistent improvements on Llama-3.1-8B (+2.7% avg vs GRPO) and Qwen2.5-Math-1.5B (+1.0% avg vs GRPO) across diverse datasets like MATH500 and AIME

Breakthrough Assessment

8/10

Identifies a fundamental mathematical flaw (loss scale issue) in the widely used GRPO framework and provides a theoretically grounded, adaptive solution that yields consistent empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for mathematical reasoning

Inputs: Math problem prompts q

Outputs: Generated solution paths and final answers o

Pipeline Flow

Prompt Sampling
Generation (Rollout)
Reward Verification
Difficulty Grouping
Dynamic Loss Calculation

System Modules

Generator (Policy)

Generate K solutions for a given prompt

Model or implementation: Qwen2.5-Math or Llama-3.1

Verifier

Check correctness of answers to compute rewards and empirical pass rate

Model or implementation: Rule-based or exact match

Dynamic Weight Optimizer

Update difficulty weights w_µ based on current loss magnitudes to balance training

Model or implementation: Learnable scalar parameters

Modeling

Base Model: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Llama-3.1-8B

Training Method: Difficulty-Aware Reweighting Policy Optimization (DARO)

Objective Functions:

Purpose: Dynamically reweight loss based on difficulty groups.

Formally: L = Σ_{µ≠0,1} (w_µ * L_µ - ln w_µ), where w_µ are learnable weights and L_µ is the GRPO loss for group µ.

Training Data:

Subset of OpenR1-Math-220k (filtered to 45,000 prompts)
OpenR1-easy (11,000 prompts) for Llama-3.1-8B

Key Hyperparameters:

learning_rate_model: 1e-6
learning_rate_weights: 1e-3
batch_size: 128
+ 4 more
mini_batch_size: 64
generation_samples_K: 8
clip_range: [0.2, 0.28]
total_steps: 300

Comparison to Prior Work

vs. GRPO: DARO uses learnable weights instead of uniform weights
vs. LIPO/Dr. GRPO: DARO adapts weights during training rather than using a fixed variance-based heuristic that drops to zero for easy/hard samples
vs. DAPO: DARO focuses on reweighting existing samples rather than just altering the sampling distribution

Limitations

Evaluation limited to mathematical reasoning tasks; generalization to other domains (coding, logic) not tested
Experiments conducted on relatively small models (up to 7B parameters) and limited training steps (300)
Does not address the binary reward sparsity problem directly, only the weighting of available signals

Reproducibility

Code availability is not explicitly provided in the paper. Dataset (OpenR1-Math-220k) is public on Hugging Face. Hyperparameters are detailed in Appendix B. Missing: Specific code for the dynamic weight update implementation.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on varied difficulty benchmarks

Benchmarks:

MATH500 (Competition Math)
GSM8K (Grade School Math)
AIME24/25 (High-difficulty Math Competitions)
OlympiadBench (Olympiad-level Math)
Minerva (Science/Math Reasoning)

Metrics:

Accuracy (Pass@1)
Mean@32 (for AIME/AMC stability)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparisons showing DARO consistently outperforming baselines across different base models.
Average (6 benchmarks)	Accuracy	49.4	50.8	+1.4
Average (6 benchmarks)	Accuracy	18.7	21.4	+2.7
MATH500	Accuracy	29.8	30.2	+0.4
AIME24	Accuracy (Mean@32)	0.0	1.0	+1.0
Ablation study demonstrating the specific contribution of the Dynamic Weights component.
Average	Accuracy	49.4	50.8	+1.4

Experiment Figures

Loss curves for GRPO showing the 'Loss Scale Issue' where loss magnitudes vary drastically across difficulty levels (µ).

Average pass rate curves during training for DARO vs baselines.

Main Takeaways

Dynamic weighting consistently outperforms static weighting strategies (GRPO, LIPO, Dr. GRPO) across multiple model sizes and difficulty levels.
The 'loss scale issue' is a real phenomenon where models focus excessively on specific difficulty bands; DARO's adaptive weights effectively mitigate this.
DARO accelerates convergence significantly, reaching comparable performance levels much earlier in training than baselines.
Static variance-based weighting (used in LIPO/Dr. GRPO) can be detrimental for weaker models (e.g., Llama-3.1-8B), leading to performance drops compared to unweighted GRPO.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Gradient Descent Optimization
Importance Sampling

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—fine-tuning LLMs using binary feedback (correct/incorrect) from verifiable answers (e.g., math problems)

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs generated from the same prompt, removing the need for a separate value function

Empirical Pass Rate: The proportion of correct answers generated for a specific prompt within a sampled group (denoted as µ), used as a proxy for sample difficulty

Loss Scale Issue: The phenomenon where training losses disproportionately cluster at certain difficulty levels due to static weighting, causing the model to ignore other valid learning signals

Clip-higher: A technique where the clipping range in the PPO objective is asymmetric or adjusted to prevent entropy collapse

Token-mean loss aggregation: A method of calculating loss by averaging per-token losses rather than summing them, often used to stabilize training for variable-length reasoning chains