← Back to Paper List

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
Gaoling School of Artificial Intelligence, Renmin University of China, AMAP, Alibaba Group, Xiamen University, Dalian University of Technology
arXiv (2026)
RL Reasoning MM

📝 Paper Summary

Mathematical Reasoning Reinforcement Learning with Verifiable Rewards (RLVR)
MathForge improves mathematical reasoning by upweighting harder questions during optimization (via DGPO) and synthetically creating harder training data that preserves original answers (via MQR).
Core Problem
Existing Group Relative Policy Optimization (GRPO) implicitly suppresses updates for harder questions, and standard data augmentation focuses on diversity rather than systematically increasing difficulty.
Why it matters:
  • Challenging but solvable questions are ideal for fixing incomplete model mastery, yet GRPO gives them lower update magnitude
  • Current augmentation methods often generate new answers that are hard to verify, or rephrase questions without increasing the cognitive demand required to solve them
Concrete Example: In standard GRPO, a question where the model gets 50% of responses correct triggers the largest update magnitude. A harder question (e.g., 10% correct) produces smaller gradients, causing the model to neglect the very problems it needs to learn most.
Key Novelty
MathForge: Difficulty-Aware Group Policy Optimization (DGPO) + Multi-Aspect Question Reformulation (MQR)
  • DGPO: Normalizes advantage estimation using Mean Absolute Deviation (MAD) instead of standard deviation to balance update magnitudes across all difficulties, then explicitly upweights harder questions
  • MQR: Uses a strong model to rewrite questions by adding story backgrounds, abstract terms, or sub-problems while keeping the original gold answer, creating a 'harder' training set without needing new solutions
Evaluation Highlights
  • Outperforms GRPO baseline by +2.18% accuracy on MATH dataset using Qwen2.5-Math-7B (39.79% vs 37.61%)
  • Achieves 56.40% on AIME 2024 with Qwen2.5-Math-7B, surpassing GRPO (52.08%) and other recent methods like GPG and DAPO
  • MQR augmentation alone improves performance by +1.6% on MATH compared to standard GRPO, validating the 'harder data' hypothesis
Breakthrough Assessment
8/10
Identifies and mathematically proves a fundamental flaw in GRPO (imbalance against hard questions) and provides a coherent dual solution (algorithm + data) with strong empirical gains.
×