CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

📝 Paper Summary

Reinforcement Learning from Verifiable Rewards (RLVR) Large Language Model Reasoning

CoRPO modifies Group-Relative Policy Optimization (GRPO) by clipping the baseline at a fixed correctness threshold, preventing incorrect solutions from receiving positive reinforcement simply because they outperform a poor group average.

Core Problem

GRPO's group-mean baseline can assign positive advantages to objectively incorrect solutions if they outperform other failures in a sampled group, leading to the reinforcement of bad behaviors.

Why it matters:

Reinforcing incorrect trajectories that happen to be 'less bad' than peers inverts the desired learning signal, effectively teaching the model to fail in specific ways.
GRPO exhibits 'distribution sharpening,' where it prematurely exploits specific solution paths rather than exploring, degrading diversity and robustness.
Standard relative baselines fail when rewards are ordinal (graded) rather than binary, as they measure rank rather than objective correctness.

Concrete Example: In a coding task where a model generates 4 incorrect solutions with rewards -1, -0.8, -0.9, and -0.7, the group mean is -0.85. The solution with reward -0.7 is objectively wrong (failed test cases) but receives a positive advantage (+0.15), reinforcing a failed attempt.

Key Novelty

Correctness-Relative Policy Optimization (CoRPO)

Modifies the advantage estimation by clipping the group-mean baseline at a minimum correctness threshold (e.g., the passing score).
Creates a dual-regime baseline: acts as a static quality threshold when group performance is poor (correctness-seeking), and reverts to a relative group mean when performance is good (quality-seeking).

Architecture

Conceptual illustration of the CoRPO baseline clipping mechanism compared to GRPO

Evaluation Highlights

CoRPO outperforms GRPO on out-of-domain (OOD) tasks, indicating better generalization of reasoning patterns.
Analysis of training dynamics shows CoRPO mitigates 'distribution sharpening,' maintaining higher entropy/exploration compared to GRPO's rapid collapse.
Demonstrates cross-domain transfer: CoRPO models trained on code improve on math tasks, whereas GRPO models often fail to transfer effectively.

Breakthrough Assessment

7/10

Identifies a subtle but critical flaw in the widely used GRPO baseline for reasoning tasks. The solution is mathematically simple, theoretically grounded, and addresses the specific issue of ordinal rewards in RLVR.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models using Policy Gradients

Inputs: Prompt x (e.g., math problem or coding specification)

Outputs: Generated trajectory y (reasoning chain and final answer)

Pipeline Flow

Policy Sampling (Generate group of outputs)
Reward Computation (External Verifier/Judge)
Advantage Estimation (CoRPO Baseline Calculation)
Policy Update (Gradient Ascent)

System Modules

Policy Model

Generate G trajectories for a given prompt

Model or implementation: Large Language Model (implied, specific architecture not detailed in text)

Reward Function

Assign ordinal/graded rewards to trajectories based on correctness

Model or implementation: Deterministic verifier or LLM-as-a-judge

Advantage Estimator

Compute advantages using the CoRPO baseline logic

Model or implementation: Analytical Function

Novel Architectural Elements

Adaptive baseline mechanism that switches between static correctness thresholding (when group mean is low) and relative group mean (when group mean is high)

Modeling

Base Model: Large Language Model (specific model family/size not explicitly reported in this text)

Training Method: Correctness-Relative Policy Optimization (CoRPO)

Objective Functions:

Purpose: Estimate the advantage of a trajectory while preventing positive reinforcement of incorrect answers.

Formally: A(y_i) = R(y_i) - max(b_mean, R_min_correct), where b_mean is the average reward of the group.
Purpose: Maximize expected advantage.

Formally: standard policy gradient objective using the CoRPO advantage.

Key Hyperparameters:

group_size_G: 4-16 (typical range mentioned)
R_min_correct: Task-dependent correctness threshold

Compute: Preserves GRPO efficiency (no value function/critic network required)

Comparison to Prior Work

vs. GRPO: Adds a max() operation to the baseline to enforce correctness constraints
vs. PPO: Eliminates the critic model while retaining stability via the group baseline
vs. NSR: CoRPO adaptively switches between penalizing failure (like NSR) and refining success (like GRPO), rather than only penalizing failure

Limitations

Relies on the definition of a valid 'correctness threshold' (R_min_correct), which might be hard to tune for purely subjective tasks
Introduces advantage underestimation in certain regimes (when group mean is high but below threshold), though authors argue this is a 'protective bias'

Reproducibility

The method is mathematically defined clearly (Eq 12). Code URL is not provided in the text. Specific model architectures and hyperparameters for the experiments are mentioned as 'implied' context (e.g. OpenAI, 2025 citations) but exact experimental setup details like learning rates are not in this excerpt.

📊 Experiments & Results

Evaluation Setup

RL training on reasoning tasks with ordinal rewards

Benchmarks:

Coding Tasks (Code Generation)
Mathematical Reasoning (Math Problem Solving)

Metrics:

Out-of-Domain (OOD) Generalization performance
Training Dynamics (Distribution Entropy/Sharpening)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

CoRPO effectively prevents the reinforcement of incorrect behaviors that GRPO inadvertently encourages due to low-performing groups.
The method acts as a regularizer against 'distribution sharpening', preventing the model from collapsing into a narrow set of solutions too early.
Cross-domain experiments suggest CoRPO learns more robust, transferable reasoning capabilities (e.g., code training benefiting math performance) compared to GRPO's task-specific overfitting.

📚 Prerequisite Knowledge

Prerequisites

Policy Gradient methods (REINFORCE, PPO)
Baselines in Reinforcement Learning (Variance reduction)
Group-Relative Policy Optimization (GRPO)

Key Terms

GRPO: Group-Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the mean reward of a group of samples from the same prompt, eliminating the need for a critic model

RLVR: Reinforcement Learning from Verifiable Rewards—training LLMs on tasks where correctness can be automatically checked (e.g., math, code) rather than relying on human preference models

Ordinal Rewards: Rewards that have a ranked order/graded scale (e.g., 0.0 to 1.0) rather than just binary pass/fail, often enabling partial credit

CoRPO: Correctness-Relative Policy Optimization—the proposed method which clips the GRPO baseline at a minimum correctness threshold to prevent reinforcing failures

Distribution Sharpening: The tendency of a policy to concentrate probability mass on a narrow set of high-reward solutions, reducing exploration and diversity

Baseline Clipping: The mechanism of enforcing a minimum value for the baseline (in this case, the correctness threshold) to ensure advantages for incorrect samples are never positive

OOD: Out-of-Distribution—evaluating the model on tasks or domains not seen during training to test generalization

Pass@k: A metric measuring the probability that at least one of k generated solutions is correct