← Back to Paper List

Teaching Language Models to Critique via Reinforcement Learning

Z Xie, L Chen, W Mao, J Xu, L Kong
The University of Hong Kong, Bytedance Seed
arXiv, 2/2025 (2025)
RL Reasoning

📝 Paper Summary

LLM Self-Improvement Code Generation Reinforcement Learning (RL) Automated Feedback
CTRL trains a dedicated critic model using reinforcement learning to provide actionable feedback that maximizes a generator's ability to fix code errors, outperforming self-correction methods.
Core Problem
Existing LLM self-improvement methods fail because models struggle to provide accurate, actionable feedback (the feedback bottleneck), often leading to performance degradation during iterative refinement.
Why it matters:
  • Without external feedback, self-improvement loops in LLMs often degrade rather than improve performance (e.g., correct solutions are revised into incorrect ones).
  • Current reward models only give numerical scores, and verification tools give low-level traces; neither provides the high-level actionable guidance needed for fixing complex code bugs.
Concrete Example: In a coding problem about finding the k-th nearest obstacle, a standard assistant implementation might incorrectly access a min-heap by index. A standard critic might fail to spot this or give vague advice. CTRL identifies the specific logic error (heaps don't maintain sorted order) and suggests replacing it with a max-heap strategy, leading to a correct solution.
Key Novelty
Critic Training via Reinforcement Learning (CTRL)
  • Decouples the critic from the generator and trains the critic specifically to maximize the probability that the *generator* produces a correct solution after receiving feedback.
  • Uses a two-stage process: first synthesizing critiques using ground-truth execution data (for warm-start), then refining the critic via Group Relative Policy Optimization (GRPO) to handle the high variance of feedback quality.
Evaluation Highlights
  • +106.1% relative improvement in Pass@1 on CodeContests when using CTRL with Qwen2.5-Coder compared to zero-shot generation.
  • Achieves 23.03% Pass@1 on CodeContests when guiding GPT-4o, outperforming GPT-4o's self-critique (20.97%) despite the critic being a smaller model.
  • Reduces regression rate (correct solutions becoming incorrect) to 0.85% compared to 3.03% for SFT baselines, enabling stable multi-turn refinement.
Breakthrough Assessment
8/10
Significant because it demonstrates 'weak-to-strong' generalization where a smaller critic improves a larger model (GPT-4o), and solves the stability issue in iterative refinement via RL.
×