Teaching Language Models to Critique via Reinforcement Learning

📝 Paper Summary

LLM Self-Improvement Code Generation Reinforcement Learning (RL) Automated Feedback

CTRL trains a dedicated critic model using reinforcement learning to provide actionable feedback that maximizes a generator's ability to fix code errors, outperforming self-correction methods.

Core Problem

Existing LLM self-improvement methods fail because models struggle to provide accurate, actionable feedback (the feedback bottleneck), often leading to performance degradation during iterative refinement.

Why it matters:

Without external feedback, self-improvement loops in LLMs often degrade rather than improve performance (e.g., correct solutions are revised into incorrect ones).
Current reward models only give numerical scores, and verification tools give low-level traces; neither provides the high-level actionable guidance needed for fixing complex code bugs.

Concrete Example: In a coding problem about finding the k-th nearest obstacle, a standard assistant implementation might incorrectly access a min-heap by index. A standard critic might fail to spot this or give vague advice. CTRL identifies the specific logic error (heaps don't maintain sorted order) and suggests replacing it with a max-heap strategy, leading to a correct solution.

Key Novelty

Critic Training via Reinforcement Learning (CTRL)

Decouples the critic from the generator and trains the critic specifically to maximize the probability that the *generator* produces a correct solution after receiving feedback.
Uses a two-stage process: first synthesizing critiques using ground-truth execution data (for warm-start), then refining the critic via Group Relative Policy Optimization (GRPO) to handle the high variance of feedback quality.

Evaluation Highlights

+106.1% relative improvement in Pass@1 on CodeContests when using CTRL with Qwen2.5-Coder compared to zero-shot generation.
Achieves 23.03% Pass@1 on CodeContests when guiding GPT-4o, outperforming GPT-4o's self-critique (20.97%) despite the critic being a smaller model.
Reduces regression rate (correct solutions becoming incorrect) to 0.85% compared to 3.03% for SFT baselines, enabling stable multi-turn refinement.

Breakthrough Assessment

8/10

Significant because it demonstrates 'weak-to-strong' generalization where a smaller critic improves a larger model (GPT-4o), and solves the stability issue in iterative refinement via RL.

⚙️ Technical Details

Problem Definition

Setting: Iterative code generation where a generator produces a solution y for problem x, and a critic C provides feedback c to refine y.

Inputs: Programming problem x and an initial candidate solution y.

Outputs: Textual critique c containing analysis, suggestions, and judgment.

Pipeline Flow

Input Problem & Solution
Critic Model (Generates Feedback)
Generator Model (Revises Solution)
Sandbox Execution (Reward Signal)

System Modules

Critic Model

Analyze input solution and generate structured feedback (analysis, suggestions, judgment)

Model or implementation: Qwen2.5-Coder-32B-Ins

Generator Model

Generate or revise code solutions based on prompts or feedback

Model or implementation: Qwen2.5-Coder-32B-Ins (Training), GPT-4o (Evaluation)

Sandbox

Execute code against test cases to provide binary reward signal

Model or implementation: SandboxFusion

Novel Architectural Elements

Decoupled Critic-Generator Loop: The critic is trained not on ground-truth labels but on a proxy task—maximizing the *generator's* success rate on the next turn.
Two-stage pipeline: Execution-guided synthesis (creating synthetic data from heuristics) followed by GRPO reinforcement learning.

Modeling

Base Model: Qwen2.5-Coder-32B-Ins

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected solution quality.

Formally: J(θ) = E[R(y)] where y is the solution refined by the generator based on the critic's feedback.
Purpose: Reduce variance in gradient estimation without a value network.

Formally: Advantages A_i calculated as (R(y_i) - mean(R)) / std(R) over a group of sampled critiques.
Purpose: Prevent model collapse/drift.

Formally: KL divergence penalty D_KL(C_theta || C_ref).

Adaptation: Full fine-tuning (implied by base model usage)

Training Data:

TACO dataset (filtered to 18,820 problems)
Removed malformed problems and those overlapping with evaluation benchmarks

Key Hyperparameters:

learning_rate: 1e-5
training_batch_size: 1,024
group_size: 8
+ 4 more
kl_coefficient: 0.001
epochs: 2
max_prompt_length: 1,536
max_response_length: 768

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-Correction: CTRL significantly outperforms self-correction by training a specialized critic rather than relying on the generator's inherent capabilities.
vs. Shepherd/CriticGPT: CTRL does not require human-annotated critique data; it learns entirely from execution feedback (solution correctness).
vs. Standard RLHF [not cited in paper]: CTRL optimizes the critic based on the *effect* of the feedback on another model, rather than the quality of the feedback text itself.

Limitations

Computational cost is higher due to iterative generation and critique steps.
Requires a defined execution environment (sandbox) and test cases for the reward signal.
Training focuses on pass rates, potentially overlooking efficiency or safety of the code.

Reproducibility

Code: https://critic-rl.github.io

📊 Experiments & Results

Evaluation Setup

Code generation tasks where a model generates a solution, receives a critique, and revises the solution.

Benchmarks:

CodeContests (Competitive Programming)
LiveCodeBench (Code Generation (24.08-24.11))
MBPP+ (Basic Programming)
JudgeBench (Generative Reward Modeling)

Metrics:

Pass@1
Δ↑ (percentage of wrong solutions corrected)
Δ↓ (percentage of correct solutions broken)
F1 Score (for discrimination)
Timeout Rate
Statistical methodology: Reported standard error across 5 seeds in figures.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on CodeContests using Qwen2.5-Coder as the Generator shows CTRL significantly improving over zero-shot and self-critique baselines.
CodeContests	Pass@1	7.88	11.76	+3.88
CodeContests	Pass@1	8.36	11.76	+3.40
Multi-turn scaling capabilities of CTRL on CodeContests.
CodeContests	Pass@1	7.88	16.24	+8.36
Weak-to-strong generalization: CTRL (Qwen-32B based) guiding GPT-4o.
CodeContests	Pass@1	20.61	23.03	+2.42
CodeContests	Pass@1	20.61	25.45	+4.84

Experiment Figures

Pass@1 performance scaling over critique-revision iterations for CTRL vs. other critics.

Regression rate (error compounding) over iterations.

Main Takeaways

CTRL significantly outperforms self-critique methods and even stronger model critics (GPT-4o) on challenging benchmarks.
The method demonstrates 'weak-to-strong' generalization: a critic trained on a weaker model (Qwen) can effectively guide a stronger model (GPT-4o).
RL training is crucial for reducing the regression rate (breaking correct code), which is the key enabler for multi-turn iterative improvement.
CTRL critics act as effective generative reward models, achieving competitive discrimination performance on JudgeBench against larger models.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Large Language Models (Code Generation)
Supervised Fine-Tuning (SFT)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that reduces variance by comparing outputs within a group rather than using a separate value network critic.

Pass@1: A metric measuring the percentage of problems where the model's first generated solution passes all test cases.

SFT: Supervised Fine-Tuning—training a model on labeled examples (here, synthetic critiques) before applying reinforcement learning.

PPO: Proximal Policy Optimization—a standard RL algorithm; found to be unstable in this paper due to credit assignment difficulties.

Regression Rate: The frequency with which initially correct solutions are modified into incorrect ones during the refinement process.

Execution Feedback: Output from a code sandbox (e.g., error messages, test outputs) used to verify solution correctness.

Test-time scaling: Improving model performance during inference (not training) by using more computation, such as generating multiple drafts or performing iterative revisions.