Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning Feedback Mechanisms

Critique-GRPO augments standard numerical RL by incorporating natural language critiques into the training loop, enabling models to learn from both initial trial-and-error attempts and critique-guided refinements.

Core Problem

Purely numerical RL (scalar rewards) suffers from performance plateaus, ineffective spontaneous self-reflection, and persistent failures where models cannot correct errors despite extensive training.

Why it matters:

Numerical rewards lack expressivity to explain why a response failed or how to fix it, leading to inefficient exploration
Current methods like R1-Zero plateau even with 8x data scaling
Existing critique methods rely on static SFT (offline), lacking the active exploration benefits of online RL

Concrete Example: A Qwen2.5-7B-Base model trained with numerical RL consistently fails on ~29% of training questions (Pass@4=0) and performance saturates after 120 steps. Spontaneous 'aha moments' (self-correction) rarely occur or improve success rates in these cases.

Key Novelty

Critique-GRPO (Group Relative Policy Optimization with Critiques)

Dual learning mechanism: The model generates initial responses via standard exploration AND refined responses via in-context learning guided by generated critiques
Integrates critiques directly into the online RL loop (GRPO) rather than just supervised fine-tuning, allowing the policy to update based on both trial-and-error and explicit verbal guidance
Uses a shaping function to reinforce valid but unfamiliar refinement trajectories, penalizing incorrect attempts while encouraging the model to adopt successful reasoning patterns found via critique

Architecture

The Critique-GRPO workflow. It illustrates the parallel process of standard generation and critique-guided refinement within the RL loop.

Evaluation Highlights

+15.0-21.6% average Pass@1 improvement on Qwen models (Base and Instruct) across 8 reasoning tasks compared to baselines
+16.7% Pass@1 gain over standard GRPO on the AIME 2024 benchmark when using self-critiques
+7.3% average Pass@1 improvement on Llama-3.2-3B-Instruct across 8 tasks

Breakthrough Assessment

8/10

Strong empirical results solving the 'plateau' problem of numerical RL. Successfully integrates verbal feedback into online RL without needing expert demonstrations, a significant step for self-improving models.

⚙️ Technical Details

Problem Definition

Setting: Online Reinforcement Learning for LLM reasoning tasks

Inputs: Reasoning questions q from a dataset Q

Outputs: Reasoning chain and final answer y

Pipeline Flow

Initial Response Sampling (Standard Generation)
Critique Generation (Rule-based or Model-based)
Critique-Guided Refinement (Conditional Generation)
Online Policy Optimization (Update on both sets)

System Modules

Policy Model

Generate reasoning paths and answers; updated via RL

Model or implementation: Qwen2.5-7B, Qwen2.5-Math-7B, Llama-3.2-3B (various base/instruct versions)

Reward System

Evaluate responses to produce scalar rewards and natural language critiques

Model or implementation: Rule-based (ground truth matching) OR Model-based (LLM evaluator)

Refinement Sampler

Generate corrected responses when initial responses fail

Model or implementation: Same as Policy Model (In-context learning)

Novel Architectural Elements

Dual-data trajectory optimization: simultaneously optimizing standard exploration trajectories and critique-conditioned refinement trajectories within a single GRPO step
Critique-dependent sampling strategy: Initiates refinement only on failure and selectively incorporates refinements to prevent distribution shift

Modeling

Base Model: Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, Qwen2.5-Math-7B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-32B-Instruct

Training Method: Critique-GRPO (Group Relative Policy Optimization with Critiques)

Objective Functions:

Purpose: Optimize policy to maximize reward for initial responses.

Formally: Standard GRPO objective (average of clipped probability ratios * advantage).
Purpose: Optimize policy to learn from critique-guided refinements.

Formally: Modified GRPO objective with a shaping term γ in the denominator of the probability ratio to stabilize updates on off-policy refinement data.

Adaptation: Full fine-tuning

Training Data:

Standard math datasets: GSM8K, MATH, GSM-Hard, SVAMP, TabMWP, ASDiv
Olympiad datasets: AIME 2024, Omni-MATH

Key Hyperparameters:

learning_rate: 1e-6 (Base), 5e-6 (Instruct)
batch_size: 16 (local), global batch 128 (Base) / 64 (Instruct)
n_samples: 8 or 16 (group size)
+ 3 more
clip_epsilon: 0.2
refinement_shaping_gamma: Not explicitly reported in the paper
kl_penalty: Omitted (beta=0)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: Critique-GRPO adds a refinement step guided by language feedback and updates the policy on both initial and refined data.
vs. R1-Zero: Critique-GRPO uses explicit critiques to guide exploration, preventing the 'persistent failure' mode where models get stuck.
vs. Iterative SFT / Self-Correction [not cited in paper]: Unlike offline iterative methods (e.g., STaR), Critique-GRPO updates the policy online within the RL loop, allowing active exploration and immediate feedback integration.

Limitations

Computational cost is higher than standard GRPO due to the additional inference step for refinement and critique generation.
Requires a reward system capable of generating critiques (either ground-truth access or a strong reward model).
Refinement is only triggered when initial responses fail, which may limit learning from 'good but imperfect' responses.
The paper mainly tests on math reasoning; generalization to other domains (coding, creative writing) is not shown.

Reproducibility

Code: https://github.com/zhangxy-2019/critique-GRPO

Code is publicly available at https://github.com/zhangxy-2019/critique-GRPO. Hyperparameters for learning rates and batch sizes are provided in Appendix B. Shaping term gamma value is not explicitly listed in the main text or appendix scan.

📊 Experiments & Results

Evaluation Setup

Evaluation on mathematical and STEM reasoning benchmarks.

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging competition math)
GSM-Hard (Harder grade school math)
SVAMP (Math word problems with varying structures)
TabMWP (Table-based math reasoning)
ASDiv (Diverse math word problems)
AIME 2024 (Olympiad-level math)
Omni-MATH (Comprehensive math benchmark)

Metrics:

Pass@1
Pass@4
Pass@16
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Critique-GRPO significantly outperforms standard GRPO and SFT baselines across various model sizes and types.
Average (8 tasks)	Pass@1	41.8	56.8	+15.0
Average (8 tasks)	Pass@1	45.1	66.7	+21.6
Average (8 tasks)	Pass@1	52.0	59.3	+7.3
Self-Improvement capabilities are demonstrated on hard Olympiad tasks using self-generated model critiques.
AIME 2024	Pass@1	46.7	63.4	+16.7
Omni-MATH	Pass@1	39.1	44.6	+5.5

Experiment Figures

Performance curves of Critique-GRPO vs. GRPO over training steps on GSM8K and MATH.

Analysis of cognitive behaviors (planning vs. self-reflection) and their contribution to success.

Main Takeaways

Critique-GRPO effectively breaks the performance plateaus observed in numerical-only RL (GRPO/R1-Zero), especially on base models.
The method works with both rule-based critiques (using ground truth) and model-based critiques (self-correction), showing robustness.
Refining only on failed instances with critiques is highly sample-efficient and sufficient to drive large gains.
The approach generalizes across different model families (Qwen, Llama) and sizes (3B, 7B, 32B).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (policy gradients, advantages)
Large Language Models (LLMs) and Chain-of-Thought (CoT) reasoning
In-context learning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding a separate value function

Pass@1: The percentage of problems where the model's single generated answer is correct

Pass@k: The probability that at least one of k generated samples is correct

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training on a fixed dataset of input-output pairs

PPO: Proximal Policy Optimization—a standard RL algorithm using a clipped objective to ensure stable policy updates

R1-Zero: A paradigm for training reasoning models via RL on base models without supervised fine-tuning data, relying on self-evolution

Shaping function: A mechanism to reweight gradients, assigning higher importance to tokens in successful refinements that the current policy considers low-probability

Eluder dimension: A measure of the complexity of a hypothesis space, used here to analyze the sample efficiency of learning

Critic/Critique: In this paper, a natural language explanation of why an answer is correct or incorrect, distinct from a scalar reward value