DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning Policy Optimization

DisCO replaces the variance-normalized advantage in Group Relative Policy Optimization (GRPO) with a discriminative objective and a squared-hinge KL constraint to eliminate difficulty bias and improve stability.

Core Problem

Group Relative Policy Optimization (GRPO) suffers from 'difficulty bias,' where the advantage function inherently down-weights questions that are too hard or too easy, and its clipping mechanism leads to entropy collapse.

Why it matters:

Current methods waste valuable training signals from very hard or very easy questions due to aggressive variance normalization
Entropy collapse in existing RL methods (like PPO/GRPO) causes models to lose exploration capabilities and produce repetitive outputs
Heuristic fixes like DAPO introduce new instabilities or excessive entropy growth without solving the root mathematical limitations

Concrete Example: If a model answers a hard question correctly only 1 out of 10 times (p=0.1), GRPO assigns this success a small weight due to its variance term, effectively ignoring a crucial learning opportunity. DisCO treats this simply as a positive instance to be reinforced.

Key Novelty

Discriminative Constrained Optimization (DisCO)

Reframes RL fine-tuning as a discriminative learning problem (similar to AUC maximization), increasing scores for correct answers and decreasing them for incorrect ones regardless of question difficulty
Replaces unstable clipping (PPO-style) with a squared-hinge penalty function that strictly enforces a KL divergence trust region, ensuring stability without vanishing gradients

Evaluation Highlights

+7% average improvement over GRPO on 1.5B parameter models across six mathematical reasoning benchmarks
+6% average improvement over DAPO (a recent GRPO variant) on the same benchmarks
Outperforms DeepScaleR-1.5B (trained with 24k context length) while using only 8k context length for both training and inference

Breakthrough Assessment

8/10

Offers a principled theoretical correction to GRPO's difficulty bias and a robust optimization strategy. The gains are significant (+7%) and the removal of clipping addresses a fundamental RLHF stability issue.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning fine-tuning of Large Reasoning Models (LRMs) with binary rule-based rewards

Inputs: Input question q (with prompt)

Outputs: Reasoning trace and final answer o, generated token-by-token

Pipeline Flow

Input Processing (Question q)
Generation (Output o via Policy)
Reward Verification (Rule-based r(o|q))

System Modules

Reasoning Model

Generates reasoning traces and answers

Model or implementation: DeepSeek-R1-Distill-Qwen/Llama (1.5B, 7B, 8B)

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B

Training Method: DisCO (Discriminative Constrained Optimization)

Objective Functions:

Purpose: Maximize the score of positive answers vs negative answers (Discriminative).

Formally: E[s_theta(o,q) - s_theta(o',q)] using log-likelihood or likelihood ratio scoring functions.
Purpose: Handle data imbalance (few positives, many negatives) via Distributionally Robust Optimization (DRO).

Formally: Maximize E_pos[s(o)] - log E_neg[exp(s(o'))].
Purpose: Enforce trust region constraint to prevent model collapse.

Formally: Subtract beta * [KL(pi_old || pi_theta) - delta]_+^2 (squared-hinge penalty).

Key Hyperparameters:

learning_rate: 5e-7 to 2e-6
batch_size: 128
mini_batch_size: 32
+ 4 more
responses_per_question: 8
kl_penalty_beta: 1000
trust_region_delta: 1e-4
temperature: 0.6

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: DisCO removes the advantage normalization term p(q)(1-p(q)) entirely, eliminating difficulty bias
vs. DAPO: DisCO uses a hard constraint (squared-hinge) for KL divergence instead of clipping, preventing entropy collapse more effectively
vs. PPO/GRPO: DisCO adopts a discriminative ranking objective (AUC-like) rather than a standard value-estimation objective

Limitations

Relies on binary rewards, requiring verifiable tasks (like math) rather than open-ended creative writing
The squared-hinge penalty introduces a new hyperparameter (beta) that must be tuned relative to the constraint (delta)
Experiments limited to mathematical reasoning benchmarks; generalization to code or other domains not tested

Reproducibility

Code: https://github.com/Optimization-AI/DisCO

Code is publicly available on GitHub. Hyperparameters (learning rates, beta, delta) are explicitly detailed for different model sizes.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using rule-based verification for rewards.

Benchmarks:

AIME 2024 (Mathematical Problem Solving)
AIME 2025 (Mathematical Problem Solving)
MATH 500 (Mathematical Problem Solving)
AMC 2023 (Mathematical Problem Solving)
Minerva (Scientific Reasoning)
Olympiad Bench (Competition Math)

Metrics:

pass@1 (averaged over k=16 responses)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DisCO consistently outperforms baseline RL methods across aggregated mathematical benchmarks.
Average across 6 math benchmarks	Relative Improvement (%)	0	6	+6

Main Takeaways

DisCO significantly outperforms GRPO and its variants (DAPO, Dr. GRPO) across multiple model sizes (1.5B, 7B, 8B)
Removing the question-level weighting factor (difficulty bias) accelerates learning, particularly for very hard or very easy questions
The constrained optimization approach maintains stable entropy levels throughout training, avoiding the collapse seen in clipping-based methods
DisCO is token-efficient, achieving better results with 8k context length than DeepScaleR baselines using 24k context length

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, TRPO)
Group Relative Policy Optimization (GRPO)
Kullback-Leibler (KL) Divergence
Discriminative Learning / AUC Maximization

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm for reasoning models that normalizes rewards within a group of outputs for the same question

Difficulty Bias: The tendency of an optimizer (like GRPO) to assign low gradients/importance to questions where the model is either very likely or very unlikely to succeed

Squared-hinge penalty: A constraint enforcement method that applies a quadratic penalty only when a constraint is violated, used here to bound KL divergence

Entropy collapse: A failure mode in RL where the policy becomes deterministic too quickly, losing diversity and exploration capability

DRO: Distributionally Robust Optimization—a framework used here to handle the imbalance between many negative answers and few positive answers

AUC Maximization: Area Under the Curve Maximization—a learning objective that focuses on ranking positive examples higher than negative ones

SFT: Supervised Fine-Tuning—the initial training phase using labeled data before RL application