Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards Mathematical Reasoning

Prompt augmentation trains models using diverse reasoning templates and format-specific rewards within a single run, preventing entropy collapse and enabling stable, long-horizon reinforcement learning for mathematical reasoning.

Core Problem

RL post-training for reasoning typically suffers from entropy collapse (monotonic decrease in policy entropy), leading to training instability and forcing short training horizons (5-20 epochs).

Why it matters:

Entropy collapse restricts exploration of diverse reasoning paths, causing performance to saturate or degrade quickly
Current methods rely on single fixed prompts, overfitting to one reasoning style and limiting the model's ability to generalize
Short training horizons prevent models from fully exploiting the available compute and data for sustained improvement

Concrete Example: When trained with a single standard template, a model's policy entropy drops near zero after ~10 epochs, causing it to output deterministic, repetitive responses. The proposed method maintains higher entropy, allowing training to continue productively for 50 epochs.

Key Novelty

Prompt Augmentation for RL Post-Training

Mixes 13 different reasoning templates (e.g., DeepSeek-style tags, free-form, explicit Chain-of-Thought, reflection) within a single training run to force diverse rollout generation
Applies template-specific format rewards to ensure the model adheres to the requested structure (e.g., rewarding usage of <think> tags only when the prompt demands it)
Acts as a lightweight data augmentation strategy that stabilizes training dynamics without needing expensive KL divergence penalties

Architecture

Overview of the template categories used for Prompt Augmentation.

Evaluation Highlights

Achieves 51.8% per-question accuracy (average across 5 benchmarks) using Qwen2.5-Math-1.5B, outperforming DAPO (48.4%) and GRPO (47.7%)
Enables stable training for 50 epochs, whereas baselines typically collapse or stop improving after 5-20 epochs
+4.0% absolute improvement on AIME24 compared to the DAPO baseline (46.0% vs 42.0%)

Breakthrough Assessment

8/10

Simple yet highly effective solution to the pervasive entropy collapse problem in RLVR. By enabling significantly longer training horizons without complex regularization, it unlocks greater performance from existing models.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) on mathematical problems

Inputs: Mathematical question q

Outputs: Reasoning trace and final answer Ans_predict

Pipeline Flow

Prompt Sampler (Selects 1 of K templates)
Policy Model (Generates G rollouts per input)
Reward Engine (Calculates Accuracy + Format Rewards)
GRPO Updater (Updates weights using group-relative advantages)

System Modules

Prompt Sampler

Augments the input question with a randomly selected reasoning template

Model or implementation: Uniform distribution over K=13 templates

Policy Model

Generates reasoning traces and answers

Model or implementation: Qwen2.5-Math-1.5B

Reward Engine

Computes rewards based on correctness and format adherence

Model or implementation: Rule-based verifiers (Math-Verify, SymPy) + String matching

Novel Architectural Elements

Integration of template-conditional format rewards directly into the GRPO loop
Multi-template rollout generation within a single batch update

Modeling

Base Model: Qwen2.5-Math-1.5B

Training Method: Group Relative Policy Optimization (GRPO) with Prompt Augmentation

Objective Functions:

Purpose: Maximize expected reward while staying within trust region.

Formally: Token-level GRPO objective over uniformly sampled templates k ~ U({1..K}) with decoupled clipping.
Purpose: Encourage diverse reasoning formats.

Formally: Reward function r includes template-specific format reward r_format alongside accuracy reward r_acc.

Adaptation: Full model update

Trainable Parameters: 1.5B

Training Data:

MATH Level 3-5 dataset (subset of 8,523 hard problems)

Key Hyperparameters:

learning_rate: 1e-6
batch_size: Prompt batch size 128, Mini-batch 32
group_size: 8
+ 6 more
epochs: 50 (standard baselines use 5-20)
beta_kl: 0 (KL penalty dropped)
epsilon_high: 0.28
epsilon_low: 0.20
max_prompt_length: 1024
max_output_length: 3072

Compute: 8 L40S GPUs

Comparison to Prior Work

vs. DAPO: Augments prompts with 13 diverse templates vs. single fixed format; extends stable training to 50 epochs vs. ~15.
vs. Dr. GRPO: Actively trains on multiple templates simultaneously vs. analyzing sensitivity to single templates.
vs. Question Augmentation (Li et al., 2025a) [not cited in paper]: Augments prompt instructions/templates rather than reformulating the question content itself.

Limitations

Computational cost of reflection-based templates is high, limiting their frequency in the mix
Currently validated primarily on the 1.5B parameter scale due to compute constraints
Relies on rule-based verifiers for math, limiting applicability to open-ended tasks without clear ground truth

Reproducibility

Code: https://github.com/wenquanlu/prompt-augmentation-GRPO

publicly available (https://github.com/wenquanlu/prompt-augmentation-GRPO). Code and model checkpoints are available. Training uses open datasets (MATH). Evaluation framework matches Dr. GRPO/SEED GRPO. 13 specific templates described conceptually but exact strings referenced from open sources.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on competition-level problems

Benchmarks:

AIME24 (Math Competition)
AMC (Math Competition)
MATH500 (Math Problem Solving)
Minerva Math (Math Problem Solving)
OlympiadBench (Olympiad Math)

Metrics:

Per-benchmark average accuracy (Pass@1)
Per-question average accuracy (Pass@1)
Statistical methodology: Report average accuracy over 5 inference rounds to account for vLLM stochasticity

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing Prompt Augmentation against GRPO and DAPO baselines on Qwen2.5-Math-1.5B.
Average (Per-Question)	Accuracy	48.4	51.8	+3.4
Average (Per-Benchmark)	Accuracy	41.8	45.2	+3.4
AIME24	Accuracy	42.0	46.0	+4.0
MATH500	Accuracy	59.9	64.8	+4.9
OlympiadBench	Accuracy	24.9	27.5	+2.6

Experiment Figures

Comparison of Per-Benchmark and Per-Question Accuracy for GRPO, DAPO, and Prompt Augmentation.

Main Takeaways

Prompt Augmentation effectively delays entropy collapse, allowing models to train for 50 epochs while maintaining performance improvements.
The method stabilizes training in low-entropy regimes; unlike baselines that collapse when entropy drops, Prompt Augmentation continues to improve policy.
Template-specific format rewards are critical; without them, the model ignores the diverse instructions and collapses similar to standard training.
Achieves state-of-the-art performance for the 1.5B scale on hard math benchmarks (AIME, OlympiadBench).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Language Model Post-training
Chain-of-Thought Prompting

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input, removing the need for a value function critic

RLVR: Reinforcement Learning with Verifiable Rewards—using ground-truth correctness (e.g., math answers) as the reward signal for RL training

Entropy collapse: A phenomenon where a model becomes overly confident (deterministic) during training, reducing diversity and causing instability

KL divergence: A penalty term often used in RL to keep the trained model close to the original reference model; often removed in reasoning tasks to allow more shift

Chain-of-Thought: Prompting technique where the model generates intermediate reasoning steps before the final answer

vLLM: A high-throughput library for LLM inference and serving

Decoupled clipping: A technique from DAPO where the PPO clipping range is separated for positive and negative advantages to better manage updates

DAPO: An RL algorithm (Diversity-Aware Policy Optimization) that modifies GRPO with decoupled clipping and other stability tricks

Pass@1: The probability that a single generated solution is correct

Policy entropy: A measure of the randomness in the model's token predictions; higher entropy means more diverse outputs