Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

📝 Paper Summary

Mathematical Reasoning Instruction Tuning / Fine-Tuning

Critique Fine-Tuning improves reasoning by training models to critique noisy responses rather than imitating correct ones, achieving state-of-the-art efficiency and performance on math benchmarks.

Core Problem

Standard Supervised Fine-Tuning (SFT) forces models to passively imitate annotated responses, which yields diminishing returns or even performance degradation on already-strong base models that require deep reasoning rather than surface-level pattern matching.

Why it matters:

Strong base models like Qwen2.5-Math already possess extensive domain knowledge; SFT on noisy or simple data can actively damage their reasoning capabilities
Creating high-quality SFT datasets usually requires millions of samples, which is computationally expensive and data-intensive
Imitation learning overlooks the 'critical thinking' process—analyzing flaws and verifying correctness—that is essential for robust reasoning

Concrete Example: When the strong base model Qwen2.5-Math-7B is trained via standard SFT on the WebInstruct dataset, its average math accuracy actually drops from 37.8% (base) to 35.1% due to noise and imitation issues. In contrast, training it to critique those same noisy responses (CFT) boosts accuracy to 57.1%.

Key Novelty

Critique Fine-Tuning (CFT)

Shifts training objective from maximizing likelihood of the correct answer P(y|x) to maximizing likelihood of a critique P(c|x,y) given a question and a noisy response
Uses a teacher model (GPT-4o) to generate critiques that identify errors, suggest improvements, and verify correctness for noisy data points
Enables the model to learn from both correct and incorrect attempts, mimicking human learning through critical analysis rather than rote memorization

Architecture

Comparison of Supervised Fine-Tuning (SFT) vs. Critique Fine-Tuning (CFT) workflows.

Evaluation Highlights

Outperforms strong SFT baselines by 4–10% absolute accuracy across six mathematical reasoning benchmarks (including MATH and AIME24)
Matches the performance of SimpleRL (DeepSeek-R1 replication) while using 140x less compute (8 H100 hours vs 1152 H100 hours)
Achieves superior performance with only 50K training samples, beating official instruct models trained on over 2 million samples (e.g., Qwen2.5-Math-Instruct)

Breakthrough Assessment

9/10

Offers a highly efficient alternative to SFT and RL for reasoning, achieving SOTA results with fraction of the data/compute. Effectively addresses the 'diminishing returns of SFT' problem.

⚙️ Technical Details

Problem Definition

Setting: Conditional text generation where the model learns to generate a critique given a query and a potential response

Inputs: Concatenation of Question x and Noisy Response y: [x; y]

Outputs: Critique c (which analyzes the correctness of y)

Pipeline Flow

Data Construction (Pairing Questions with Noisy Responses)
Teacher Critique Generation (GPT-4o generates critiques)
Critique Fine-Tuning (Model trains on [Question; Response] -> Critique)

System Modules

Noisy Response Generator (Data Construction)

Provide the 'student' response to be critiqued (can be from the dataset itself or generated by a weaker model)

Model or implementation: Original WebInstruct responses OR Qwen2.5-Base generations

Teacher Critique Model (Data Construction)

Generate high-quality critiques (labels) for training

Model or implementation: GPT-4o-1120

Student Model (Target)

Learn to generate critiques

Model or implementation: Qwen2.5-Math-7B (Base)

Novel Architectural Elements

Shift in training topology: Input includes the 'answer' (noisy), output is the 'analysis' (critique), differing from standard SFT (Input=Question, Output=Answer)

Modeling

Base Model: Qwen2.5-Math-7B, Qwen2.5-7B, DeepSeek-Math-7B

Training Method: Supervised Fine-Tuning on Critique Data (CFT)

Objective Functions:

Purpose: Maximize likelihood of the critique given the query and noisy response.

Formally: argmax_theta log P(c | [x; y]; theta)

Adaptation: Full fine-tuning

Trainable Parameters: 7B parameters (Full model)

Training Data:

WebInstruct-CFT (50K samples): Derived from WebInstruct, with critiques generated by GPT-4o
Validation set: MATH-500

Key Hyperparameters:

learning_rate: 5e-6
batch_size: 512 (global)
epochs: 1
+ 2 more
scheduler: cosine decay
warmup_ratio: 0.1

Compute: 8xH100 GPUs for 1 hour (for 50K examples)

Comparison to Prior Work

vs. SFT: CFT learns to critique input-output pairs rather than just mapping input to output
vs. SimpleRL: CFT is supervised (offline) and requires significantly less compute (140x less) than online RL exploration
vs. Self-Correction [not cited in paper]: CFT embeds the critique capability into the weights via fine-tuning rather than relying on prompt-based inference-time correction

Limitations

Relies on a strong teacher model (GPT-4o) to synthesize critiques
Performance gains might be specific to reasoning-heavy tasks (Math/STEM)
Does not explore iterative critique or multi-turn refinement during inference

Reproducibility

Code: https://tiger-ai-lab.github.io/CritiqueFineTuning/

Project page provided. Dataset construction (WebInstruct-CFT) fully described using GPT-4o. Base models are open weights (Qwen, DeepSeek). Training costs are very low (1 hour on 8 H100s), facilitating replication.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on mathematical and STEM reasoning benchmarks

Benchmarks:

MATH (Mathematical reasoning)
GSM8K (Grade school math)
AIME 2024 (Competition math)
GPQA (Scientific reasoning)
MT-Bench (General instruction following)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing CFT consistently outperforming SFT variants on the Qwen2.5-Math-7B base model across standard math benchmarks.
Average (MATH, GSM8K, Minerva, AIME24, AMC23, OlymBench)	Accuracy	50.4	57.1	+6.7
MATH	Accuracy	73.9	80.6	+6.7
AIME 2024	Accuracy	30.0	50.0	+20.0
Efficiency comparison showing CFT matching heavy-compute RL methods.
Average (5 math benchmarks)	Accuracy	60.4	60.4	0.0
Data efficiency comparison against official instruct models trained on millions of samples.
Average (All 9 STEM benchmarks)	Accuracy	47.7	48.1	+0.4

Main Takeaways

CFT is significantly more data-efficient than SFT; 50K CFT samples outperform 2.5M SFT samples on Qwen2.5-Math.
Critique-based training generalizes well: improving math reasoning also boosts scores on general instruction following (MT-Bench) and strict format following (IF_Eval).
The method is robust to the source of noisy responses (using original dataset noise vs model-generated noise yields similar results).
Even using a weaker teacher for critiques (GPT-4o-mini) yields substantial gains (+11.9% on MATH vs SFT-verified), though a stronger teacher (GPT-4o) is better.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Language Model Pre-training
Reinforcement Learning (RL) concepts (for comparison)

Key Terms

CFT: Critique Fine-Tuning—training a model to generate critiques of noisy answers instead of generating the answers directly

SFT: Supervised Fine-Tuning—training a model to imitate reference answers given questions

WebInstruct: A dataset of instruction-response pairs collected from online educational resources, used as the source for noisy responses

SimpleRL: An open replication of the DeepSeek-R1 reinforcement learning method for reasoning models

Zero-shot: Testing a model without providing any example inputs in the prompt

CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer