iGRPO: Self-Feedback-Driven LLM Reasoning

📝 Paper Summary

Mathematical Reasoning Reinforcement Learning for LLMs

iGRPO improves mathematical reasoning by training models to refine their own best generated drafts, creating a self-improving feedback loop within the reinforcement learning optimization process.

Core Problem

Standard Reinforcement Learning (RL) treats reasoning generation as a single-pass process, failing to leverage the iterative refinement and self-correction strategies that characterize effective human problem-solving.

Why it matters:

Humans rarely solve complex problems in one attempt; they iterate and refine based on internal feedback
Existing RL methods like GRPO optimization optimize independent generations, missing the opportunity to learn from the model's own best prior attempts
Single-pass optimization limits the model's ability to correct errors or deepen reasoning chains during the learning process

Concrete Example: When training on a complex math problem, a standard GRPO model treats every attempt as independent. In contrast, iGRPO first generates several drafts, identifies the one that got the correct answer (even if the reasoning was messy), and then feeds this best draft back to the model as a prompt, forcing it to learn how to refine and perfect that specific solution path.

Key Novelty

Iterative Group Relative Policy Optimization (iGRPO)

Introduces a two-stage training loop: Stage 1 samples exploratory drafts and selects the best one using a reward model; Stage 2 optimizes the policy to generate refinements conditioned on that best draft.
Uses 'dynamic self-conditioning' where the prompting context evolves (bootstraps) as the policy improves, ensuring the model always trains on refining its current best capabilities.

Architecture

The two-stage training workflow of iGRPO.

Evaluation Highlights

Achieves 85.62% accuracy on AIME24 with OpenReasoning-Nemotron-7B, setting a new state-of-the-art result.
Achieves 79.64% accuracy on AIME25 with OpenReasoning-Nemotron-7B.
Consistently outperforms standard GRPO baselines across 7B and 14B parameter models on benchmarks like MATH and GSM8K under matched rollout budgets.

Breakthrough Assessment

8/10

Offers a simple yet logically grounded extension to GRPO that aligns training with iterative human reasoning. The reported gains on difficult benchmarks like AIME are significant, though full baseline numbers for direct comparison are needed.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from verifiable rewards (math problems)

Inputs: Prompt q (e.g., math problem)

Outputs: Reasoning trace and final answer a

Pipeline Flow

Input Processing (Prompt q)
Generation (Single-shot inference)
Output (Answer a)

System Modules

Base LLM

Generate reasoning trace and answer from prompt

Model or implementation: OpenReasoning-Nemotron-7B (inference is single-shot)

Modeling

Base Model: OpenReasoning-Nemotron-7B / DeepSeek-R1 Distilled / OpenMath-Nemotron (7B/14B)

Training Method: Iterative Group Relative Policy Optimization (iGRPO)

Objective Functions:

Purpose: Optimize policy to improve refinements of best drafts.

Formally: Maximize E[min(ratio * A, clip(ratio) * A)] - beta * KL, where advantages A are computed from Stage 2 completions relative to the group mean.
Purpose: Penalize deviation from reference policy.

Formally: Per-token KL divergence estimator D_KL = (pi_theta / pi_ref) - log(pi_theta / pi_ref) - 1.

Trainable Parameters: Full model parameters (policy)

Training Data:

AceReason-Math dataset
MATH dataset

Key Hyperparameters:

stage_1_samples_N: 8
stage_2_samples_G: 8
comparison_baseline_G_GRPO: 16
+ 1 more
rollout_budget: Fixed (N + G = G_GRPO)

Compute: Matched rollout budget to standard GRPO (no increase in generation cost per step)

Comparison to Prior Work

vs. GRPO: iGRPO adds a draft selection stage and conditions the optimization step on the best self-generated draft.
vs. Critique-GRPO: iGRPO uses the best *draft* itself as context rather than generating a separate critique.
vs. STaR: iGRPO integrates the selection and refinement directly into the on-policy RL loop (GRPO) rather than an offline fine-tuning loop.

Limitations

Dynamic self-conditioning is used only during training; inference remains single-shot, potentially creating a train-test mismatch.
Requires a verifiable reward signal (e.g., correct answer) to select the best draft in Stage 1, limiting applicability to tasks with ground truth.
Dependence on the base model's ability to generate at least one correct draft in the initial exploration stage.

Reproducibility

Prompt templates provided in supplementary materials. Code is not explicitly linked. Trained weights for 'OpenReasoning-Nemotron-7B' are mentioned as a base but specific iGRPO-trained weights availability is unclear.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with ground-truth answers

Benchmarks:

AIME24 (High-school math competition)
AIME25 (High-school math competition)
MATH (Competition math problems)
GSM8K (Grade school math)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

iGRPO achieves state-of-the-art results on AIME24 (85.62%) and AIME25 (79.64%) using the OpenReasoning-Nemotron-7B base model.
The method consistently outperforms standard GRPO on 7B and 14B models across multiple math benchmarks (MATH, GSM8K, AMC23) when using matched rollout budgets.
Ablations suggest the benefits come from the iterative refinement process ('bootstrapping') where better policies generate better drafts, which in turn enable better learning.
The approach effectively delays entropy collapse compared to standard baselines, maintaining exploration longer during training.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs rather than using a learned value function

iGRPO: Iterative Group Relative Policy Optimization—the proposed method that adds a draft-generation and self-conditioning stage to GRPO

AIME: American Invitational Mathematics Examination—a challenging high-school level mathematics competition used as a benchmark

bootstrapping: A process where a system improves itself by using its own outputs (e.g., best drafts) as training signals

dynamic self-conditioning: Conditioning the model generation on its own previous best outputs, which evolve (change) as the model learns

PPO: Proximal Policy Optimization—a standard RL algorithm that limits how much the policy can change in one step to ensure stability

rollout: A single complete generation (completion) produced by the model during the RL training process

KL divergence: A statistical measure of how one probability distribution differs from another, used here to prevent the model from drifting too far from its original behavior