Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

📝 Paper Summary

Mathematical Reasoning Reinforcement Learning with Verifiable Rewards (RLVR)

MathForge improves mathematical reasoning by upweighting harder questions during optimization (via DGPO) and synthetically creating harder training data that preserves original answers (via MQR).

Core Problem

Existing Group Relative Policy Optimization (GRPO) implicitly suppresses updates for harder questions, and standard data augmentation focuses on diversity rather than systematically increasing difficulty.

Why it matters:

Challenging but solvable questions are ideal for fixing incomplete model mastery, yet GRPO gives them lower update magnitude
Current augmentation methods often generate new answers that are hard to verify, or rephrase questions without increasing the cognitive demand required to solve them

Concrete Example: In standard GRPO, a question where the model gets 50% of responses correct triggers the largest update magnitude. A harder question (e.g., 10% correct) produces smaller gradients, causing the model to neglect the very problems it needs to learn most.

Key Novelty

MathForge: Difficulty-Aware Group Policy Optimization (DGPO) + Multi-Aspect Question Reformulation (MQR)

DGPO: Normalizes advantage estimation using Mean Absolute Deviation (MAD) instead of standard deviation to balance update magnitudes across all difficulties, then explicitly upweights harder questions
MQR: Uses a strong model to rewrite questions by adding story backgrounds, abstract terms, or sub-problems while keeping the original gold answer, creating a 'harder' training set without needing new solutions

Evaluation Highlights

Outperforms GRPO baseline by +2.18% accuracy on MATH dataset using Qwen2.5-Math-7B (39.79% vs 37.61%)
Achieves 56.40% on AIME 2024 with Qwen2.5-Math-7B, surpassing GRPO (52.08%) and other recent methods like GPG and DAPO
MQR augmentation alone improves performance by +1.6% on MATH compared to standard GRPO, validating the 'harder data' hypothesis

Breakthrough Assessment

8/10

Identifies and mathematically proves a fundamental flaw in GRPO (imbalance against hard questions) and provides a coherent dual solution (algorithm + data) with strong empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for mathematical reasoning

Inputs: Mathematical query q

Outputs: Reasoning chain and final answer o

Pipeline Flow

MQR: Augment Data (Original Q → Reformulated Hard Q)
Policy Rollout (Generate Group of G responses)
Reward Verification (Check correctness)
DGPO: Advantage Estimation (DGAE)
DGPO: Weighting (DQW)
Policy Update

System Modules

Multi-Aspect Question Reformulation (MQR)

Generate harder variants of training questions while keeping the answer fixed

Model or implementation: OpenAI o3 (or smaller open-source models)

Policy Model

Generate reasoning paths and answers

Model or implementation: Qwen2.5-Math-7B (and others)

Difficulty-Aware Group Policy Optimization (DGPO)

Compute gradients with balanced advantages and difficulty-based weighting

Model or implementation: Algorithm (Loss Function)

Novel Architectural Elements

Two-dual framework combining algorithmic fix (DGPO) with data-centric difficulty scaling (MQR)
Difficulty-Balanced Group Advantage Estimation (DGAE) replacing standard deviation normalization with Mean Absolute Deviation

Modeling

Base Model: Qwen2.5-Math-7B

Training Method: Difficulty-Aware Group Policy Optimization (DGPO)

Objective Functions:

Purpose: Maximize expected reward while keeping updates stable and prioritizing hard questions.

Formally: L_DGPO = E [ λ_s * min( ratio * A_DG, clip(ratio) * A_DG ) - β * D_KL ]
Purpose: Normalize advantages to ensure constant update magnitude.

Formally: A_DG = (r_i - mean(r)) / (MAD(r) + ε)
Purpose: Weight questions based on difficulty (inverse accuracy).

Formally: λ_s = (1 - mean(r_s))^T / Sum((1 - mean(r))^T)

Adaptation: Full parameter update

Training Data:

MATH dataset (Train)
MQR augmented versions of MATH

Key Hyperparameters:

temperature_T: 2.0
group_size_G: Not explicitly reported in the paper
kl_beta: Not explicitly reported in the paper
+ 1 more
learning_rate: Not explicitly reported in the paper

Compute: 8 NVIDIA H20 GPUs

Comparison to Prior Work

vs. GRPO: DGPO rectifies implicit update imbalance via MAD normalization and explicitly weights harder questions
vs. GRPO-AD: DGPO uses a two-step 'balance-then-reweight' approach (DGAE + DQW) rather than just reweighting the advantage term, offering better stability
vs. Data Augmentation (Liang et al., 2025): MQR focuses specifically on increasing difficulty (abstraction, complexity) rather than just diversity

Limitations

MQR relies on strong teacher models (like OpenAI o3) for high-quality reformulation
Reformulations must strictly preserve the gold answer, limiting the types of difficulty that can be introduced (e.g., cannot change the underlying numerical values)
Main experiments focus heavily on math; generalization to other reasoning domains (coding, logic) is less explored

Reproducibility

Code: https://github.com/AMAP-ML/MathForge

Code and augmented data are publicly available at https://github.com/AMAP-ML/MathForge. The paper provides prompts for MQR. Hyperparameters like group size and learning rate are not explicitly detailed in the main text but referenced as following Open-R1 codebase defaults.

📊 Experiments & Results

Evaluation Setup

Zero-shot mathematical reasoning on standard benchmarks

Benchmarks:

MATH (Mathematical Reasoning)
AIME 2024 / 2025 (Competition Math)
AMC 23 (Competition Math)
MATH-500 (Mathematical Reasoning)
Minerva Math (Mathematical Reasoning)
OlympiadBench (Competition Math)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Performed 32 runs for AIME/AMC and 4 runs for other benchmarks; reported averages.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing MathForge (DGPO + MQR) against baselines on the Qwen2.5-Math-7B model.
MATH	Accuracy	37.61	42.06	+4.45
AIME 2024	Accuracy	52.08	56.40	+4.32
MATH	Accuracy	37.61	39.79	+2.18
MATH	Accuracy	37.61	39.21	+1.60
OlympiadBench	Accuracy	51.35	56.09	+4.74

Main Takeaways

DGPO and MQR are synergistic: applying both yields higher gains than the sum of their individual improvements on most benchmarks.
The method scales across model sizes: Improvements are consistent on Qwen2.5-Math-1.5B, Qwen2.5-3B, and DeepSeek-Math-7B.
Generalizes to multimodal: DGPO improves Qwen2.5-VL-3B-Instruct performance on GeoQA-8k.
MathForge is particularly effective on harder benchmarks (AIME, Olympiad) compared to easier ones, validating the focus on hard samples.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO framework)
Group Relative Policy Optimization (GRPO)
Language Model alignment/tuning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding a separate critic model

RLVR: Reinforcement Learning with Verifiable Rewards—using ground-truth correctness (like math answers) as rewards instead of a learned reward model

DGAE: Difficulty-Balanced Group Advantage Estimation—a component of DGPO that normalizes advantages using Mean Absolute Deviation to ensure constant update magnitude regardless of question difficulty

DQW: Difficulty-Aware Question-Level Weighting—a mechanism in DGPO that assigns higher loss weights to questions with lower average accuracy (harder questions)

MQR: Multi-Aspect Question Reformulation—a data augmentation strategy that rewrites questions to be harder (e.g., more abstract) while preserving the original answer

MAD: Mean Absolute Deviation—average distance between each data point and the mean

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

PPO: Proximal Policy Optimization—a standard RL algorithm that constraints policy updates to prevent instability