← Back to Paper List

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang
Texas A&M University
arXiv (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning Policy Optimization
DisCO replaces the variance-normalized advantage in Group Relative Policy Optimization (GRPO) with a discriminative objective and a squared-hinge KL constraint to eliminate difficulty bias and improve stability.
Core Problem
Group Relative Policy Optimization (GRPO) suffers from 'difficulty bias,' where the advantage function inherently down-weights questions that are too hard or too easy, and its clipping mechanism leads to entropy collapse.
Why it matters:
  • Current methods waste valuable training signals from very hard or very easy questions due to aggressive variance normalization
  • Entropy collapse in existing RL methods (like PPO/GRPO) causes models to lose exploration capabilities and produce repetitive outputs
  • Heuristic fixes like DAPO introduce new instabilities or excessive entropy growth without solving the root mathematical limitations
Concrete Example: If a model answers a hard question correctly only 1 out of 10 times (p=0.1), GRPO assigns this success a small weight due to its variance term, effectively ignoring a crucial learning opportunity. DisCO treats this simply as a positive instance to be reinforced.
Key Novelty
Discriminative Constrained Optimization (DisCO)
  • Reframes RL fine-tuning as a discriminative learning problem (similar to AUC maximization), increasing scores for correct answers and decreasing them for incorrect ones regardless of question difficulty
  • Replaces unstable clipping (PPO-style) with a squared-hinge penalty function that strictly enforces a KL divergence trust region, ensuring stability without vanishing gradients
Evaluation Highlights
  • +7% average improvement over GRPO on 1.5B parameter models across six mathematical reasoning benchmarks
  • +6% average improvement over DAPO (a recent GRPO variant) on the same benchmarks
  • Outperforms DeepScaleR-1.5B (trained with 24k context length) while using only 8k context length for both training and inference
Breakthrough Assessment
8/10
Offers a principled theoretical correction to GRPO's difficulty bias and a robust optimization strategy. The gains are significant (+7%) and the removal of clipping addresses a fundamental RLHF stability issue.
×