MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

📝 Paper Summary

Mathematical Reasoning RLHF / RLAIF

MMR-GRPO accelerates mathematical reasoning training by penalizing semantically redundant completions during reward calculation, reducing wall-clock training time by 70% while maintaining performance.

Core Problem

GRPO-style training is computationally expensive due to generating multiple completions per prompt, and recent efficiency methods like Dynamic Sampling reduce training steps but increase wall-clock time due to high per-step overhead.

Why it matters:

Lengthy training times create barriers for academic researchers with limited GPU budgets
Existing efficiency methods (DAPO) create a paradox where fewer training steps result in longer actual training time (up to 3x longer per step)
High computational costs lead to excessive energy consumption and carbon footprints

Concrete Example: In a prompt generating 6 completions, if completions C1, C2, and C3 are semantically identical reasoning paths, vanilla GRPO treats them equally. MMR-GRPO identifies this redundancy and downweights C2 and C3, forcing the model to learn from the unique solution C1 and explore other diverse paths like C4 or C6.

Key Novelty

Diversity-Aware Reward Reweighting via Maximal Marginal Relevance (MMR)

Applies Information Retrieval principles to RL: treats generated completions like search results, where redundancy reduces marginal value
Reweights rewards within a group by subtracting a diversity penalty based on semantic similarity to already selected high-reward completions
Uses a parameter-free adaptive mechanism to automatically tune the diversity-relevance trade-off based on the reward variance of the group

Architecture

Conceptual comparison of vanilla GRPO vs. MMR-GRPO reward weighting

Evaluation Highlights

Reduces wall-clock training time by 70.2% on average across 1.5B, 7B, and 8B models compared to GRPO and DAPO baselines
Achieves peak performance in 47.9% fewer training steps on average across five mathematical benchmarks (including AIME 2024 and MATH-500)
Reduces training time for DeepSeek-R1-Distill-Llama-8B from 93.75 hours (DAPO) to 17.40 hours (MMR-DAPO-No-DS) while maintaining comparable accuracy

Breakthrough Assessment

8/10

Addresses a critical, practical inefficiency in current reasoning RL methods (wall-clock time vs. steps). The parameter-free adaptive mechanism makes it highly usable. Significant time savings (70%) with no performance loss.

⚙️ Technical Details

Problem Definition

Setting: Group Relative Policy Optimization (GRPO) for mathematical reasoning

Inputs: Mathematical problem prompt x

Outputs: Group of reasoning chains and answers {y1, y2, ..., yG}

Pipeline Flow

Input Prompt
LLM Generation
Answer Output

System Modules

Generator

Generate reasoning steps and final answer

Model or implementation: DeepSeek-R1-Distill-Qwen/Llama (1.5B, 7B, 8B)

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B

Training Method: MMR-GRPO (Modified Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize diversity-aware rewards.

Formally: L = -E[log π(y|x) * A(y) - β * D_KL]
Purpose: Adjust completion rewards using MMR.

Formally: score(yi) = λ*r(yi) - (1-λ)*max(similarity(yi, yj))
Purpose: Adaptively tune lambda based on group variance.

Formally: λ_adapt = 1 / (1 + e^-std(r))

Adaptation: LoRA (Low-Rank Adaptation) for 7B/8B models; full tuning implied for 1.5B

Trainable Parameters: Not specifically reported in the paper (standard LoRA settings implied)

Training Data:

knoveleng/open-rs dataset
1.7k mathematical problems with step-by-step solutions

Key Hyperparameters:

group_size_n: 16 (for evaluation metrics), usually 6-16 for training
training_steps: 500 (max)
evaluation_frequency: Every 50 steps
+ 1 more
per_step_overhead: 1-5% (MMR computation)

Compute: 2x NVIDIA H100 80GB GPUs

Comparison to Prior Work

vs. DAPO: MMR reweights samples within groups rather than discarding groups, saving 70-80% wall-clock time
vs. DR-GRPO: MMR adds diversity explicit penalty to the reward signal
vs. Vanilla GRPO: Adds embedding-based similarity check to penalize redundant reasoning paths

Limitations

Greedy MMR selection has quadratic time complexity O(N^2), though negligible for small group sizes (N=6-16)
Experiments limited to models up to 8B parameters with LoRA; not tested on 70B+ or full fine-tuning
Evaluation restricted to English mathematical reasoning benchmarks; generalization to code/commonsense unproven

Reproducibility

Code and trained models not yet released. Dataset (knoveleng/open-rs) and embedding model (jina-embeddings-v2-small-en) are public. Adaptive lambda formula is explicitly provided.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning with Chain-of-Thought generation

Benchmarks:

MATH-500 (High School Competition Math)
AIME 2024 (Competition Math (Hard))
AMC 2023 (High School Competition Math)
Minerva Math (Undergraduate Math)
OlympiadBench (Olympiad Math)

Metrics:

Pass@1 (n=16)
Wall-clock training time
Training steps to peak performance
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Peak Performance (Pass@1) and Training Costs (Time/Steps) on 7B Models.
Average across 5 benchmarks	Wall-clock Time (hours)	28.28	12.22	-16.06
Average across 5 benchmarks	Peak Training Step	350	150	-200
AIME 2024	Pass@1	0.554	0.560	+0.006
Comparison against DAPO (Dynamic Sampling) on 8B Models, highlighting the disconnect between step reduction and time reduction.
Average across 5 benchmarks	Wall-clock Time (hours)	93.75	17.40	-76.35
Average across 5 benchmarks	Time per Step (s)	2109	348	-1761
MATH-500	Pass@1	0.889	0.888	-0.001

Experiment Figures

Training curves (Pass@1 vs Training Steps) for 7B models across GRPO, DR-GRPO, and DAPO variants.

Pass@k curves (k=1 to 16) for DAPO vs MMR-DAPO-No-DS across model sizes.

Main Takeaways

MMR reweighting consistently reduces training steps to convergence by ~48% across model sizes (1.5B to 8B) without degrading final performance.
The method eliminates the high per-step computational overhead found in Dynamic Sampling (DAPO), converting step reduction directly into wall-clock time savings (70% avg).
Adaptive lambda (parameter-free) performs comparably to or better than best fixed lambda values, eliminating the need for hyperparameter tuning.
Diversity-aware training does not harm the model's exploration capability at inference time (Pass@k curves remain overlapping).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients, PPO)
Sentence Embeddings
Information Retrieval (MMR)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a sampled group of completions to estimate advantages without a value function

MMR: Maximal Marginal Relevance—a ranking method from information retrieval that selects items maximizing a linear combination of relevance (reward) and novelty (dissimilarity to selected items)

DAPO: Decoupled Clip and Dynamic sAmpling Policy Optimization—a GRPO variant that uses dynamic sampling to discard low-variance groups, reducing steps but increasing compute

Pass@1: The probability that a single generated solution is correct, often estimated by averaging correctness over multiple samples

KL Divergence: A statistical measure used as a penalty in RL to prevent the trained model from drifting too far from the reference model

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of parameters