Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

📝 Paper Summary

Reinforcement Learning for Reasoning Self-Supervised Learning LLM Post-training

Co-rewarding stabilizes self-supervised reinforcement learning by deriving rewards from cross-view supervision—either through data augmentation or a slowly updating teacher model—rather than single-view self-consistency.

Core Problem

Self-rewarding RL methods often suffer from 'training collapse' where the model hacks the reward by converging to trivial, self-consistent, but incorrect solutions (self-consistent illusion).

Why it matters:

Reliance on ground-truth labels (RLVR) scales poorly for complex reasoning tasks where data is scarce
Current label-free methods (entropy, self-consistency) encourage the model to reduce uncertainty without ensuring correctness, leading to repetitive or delusional outputs
Collapse limits the scalability of self-supervised reasoning elicitation, preventing models from improving beyond their initial capabilities

Concrete Example: In consensus-based rewarding, a model might converge to generating the same incorrect answer for a math problem every time. Because the answers are consistent (high consensus), the model receives high rewards, reinforcing the error and causing the policy to collapse into this incorrect local optimum.

Key Novelty

Co-rewarding Framework (Cross-view Supervision)

Replaces single-view self-consistency with 'invariance' across views to verify reasoning validity
Co-rewarding-I (Data-side): Checks if reasoning remains consistent when the question is rephrased (analogy-invariance)
Co-rewarding-II (Model-side): Checks if the current policy's output matches a 'teacher' reference model that updates slowly via EMA (temporal invariance)

Architecture

The Co-rewarding framework illustrating two instantiations: Co-rewarding-I (Data-side) and Co-rewarding-II (Model-side).

Evaluation Highlights

Achieves 94.01% Pass@1 on GSM8K with Qwen3-8B-Base using Co-rewarding-II, surpassing the Ground-Truth Reward baseline
Outperforms self-rewarding baselines by +7.49% on average on Llama-3.2-3B-Instruct across multiple reasoning benchmarks
Co-rewarding-I delivers +4.42% average relative gain over best baselines on MATH benchmarks

Breakthrough Assessment

8/10

Significant because it demonstrates that self-supervised signals can match or exceed ground-truth supervision in specific reasoning tasks by effectively solving the stability/collapse problem.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised Reinforcement Learning (RL) for reasoning tasks without ground-truth labels

Inputs: Reasoning question x

Outputs: Reasoning trace and final answer y

Pipeline Flow

Input Processing: Question x → [Rephrasing (optional)]
Generation: Policy π generates rollouts y
Reference/Teacher Generation: Teacher π_ref generates rollouts (for Co-rewarding-II/III)
Reward Computation: Cross-view consistency check
Optimization: GRPO update

System Modules

Rephraser

Generates semantically equivalent but superficially different versions of the input question (used in Co-rewarding-I and III)

Model or implementation: Qwen3-32B

Policy Model (Student)

Generates reasoning traces and answers; the model being optimized

Model or implementation: Target LLM (e.g., Qwen3-8B-Base, Llama-3.2-3B-Instruct)

Teacher Model (Reference)

Provides stable pseudo-labels to guide the student; updates slowly to prevent chasing unstable policy outputs

Model or implementation: Copy of Policy Model, updated via EMA

Novel Architectural Elements

Dual-path rewarding architecture where supervision comes from a separate view (rephrased data or EMA teacher) rather than the policy's own current outputs
Integration of EMA-based teacher (typically used in vision/representation learning) directly into the GRPO reward loop

Modeling

Base Model: Qwen2.5-3B/7B, Qwen3-1.7B/4B/8B-Base, Llama-3.2-3B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while staying close to reference.

Formally: Maximize E[min(ratio * A, clip(ratio, 1-eps, 1+eps) * A)] - beta * KL(pi || pi_ref)
Purpose: Co-rewarding-I Reward (Data-side).

Formally: Reward = 1 if answer matches majority vote of rephrased question's outputs, else -1 (or softened)
Purpose: Co-rewarding-II Reward (Model-side).

Formally: Reward = 1 if answer matches majority vote of teacher (EMA) model's outputs, else -1

Training Data:

MATH (7,500 questions)
DAPO-14k (14.1k questions)
OpenRS (7,000 questions)

Key Hyperparameters:

global_batch_size: 128
rollouts_per_question: 8
learning_rate: 3e-6
+ 4 more
optimizer: AdamW
ema_decay_start: 0.99
ema_decay_end: 0.9999
ema_schedule: cosine annealing

Compute: 4x H100-80GB GPUs

Comparison to Prior Work

vs. Majority Voting: Co-rewarding uses rephrased questions (I) or a temporal teacher (II) to generate the consensus, decoupling the reward signal from the immediate instability of the current policy
vs. Entropy/Self-Certainty: Co-rewarding enforces correctness through cross-view consistency rather than just minimizing uncertainty, preventing the model from collapsing to confident but wrong answers
vs. RLVR with GT: Co-rewarding does not require ground truth labels, yet achieves comparable or better stability

Limitations

Dependency on rephrasing quality (Co-rewarding-I): Poor paraphrases may alter problem semantics
Computational cost: Co-rewarding-II requires maintaining a teacher model and generating extra rollouts
Generalization limits: While effective on math/code, applicability to open-ended creative tasks is less clear

Reproducibility

Code: https://github.com/tmlr-group/Co-rewarding

Code released at https://github.com/tmlr-group/Co-rewarding. Framework implemented on VeRL. Qwen3-32B used for rephrasing. EMA schedule and hyperparameters explicitly detailed.

📊 Experiments & Results

Evaluation Setup

Post-training on math datasets (MATH, DAPO, OpenRS) and evaluation on math, code, and general instruction following benchmarks.

Benchmarks:

MATH500 (Mathematical reasoning)
GSM8K (Grade school math)
AMC / AIME24 (Competition math)
LiveCodeBench / CRUX (Code generation)
MMLU-Pro / IFEval (General multi-task & Instruction following)

Metrics:

Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Co-rewarding outperforms self-rewarding baselines and approaches ground-truth performance on mathematical reasoning benchmarks.
Average across 4 Math Benchmarks (Table 1)	Pass@1	Not reported in the paper	Not reported in the paper	+4.42%
Average across 4 Math Benchmarks (Table 2)	Pass@1	Not reported in the paper	Not reported in the paper	+12.90%
Llama-3.2-3B-Instruct (Multiple Benchmarks)	Pass@1	Not reported in the paper	Not reported in the paper	+7.49%
GSM8K	Pass@1	Not reported in the paper	94.01	Not reported in the paper
Average Performance	Relative Gain	Not reported in the paper	Not reported in the paper	+1.72%

Experiment Figures

Training curves illustrating the collapse phenomenon in baseline methods vs. stability in Co-rewarding.

Main Takeaways

Co-rewarding mitigates training collapse: Unlike baselines that plateau or degrade due to reward hacking, Co-rewarding maintains stable improvements.
Cross-view supervision is effective: Both data-side analogy (Co-rewarding-I) and model-side teacher distillation (Co-rewarding-II) provide robust signals.
Can surpass Ground Truth: In some easier tasks like GSM8K, self-generated signals allow for better exploration than strict GT supervision.
Generalization: Improvements in math reasoning transfer to code generation (CRUX) without degrading general instruction following capabilities (IFEval, MMLU-Pro).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Group Relative Policy Optimization (GRPO)
Self-supervised learning concepts (Contrastive Learning, Self-distillation)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL training using objective signals like correct math answers or passing code tests

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the average reward of a group of samples for the same input

EMA: Exponential Moving Average—a technique to update model weights slowly over time, creating a stable 'teacher' model from the 'student' policy

Training Collapse: A failure mode where the model optimizes the reward metric (e.g., consistency) via trivial solutions (e.g., repetition) rather than solving the task

Reward Hacking: When an RL agent exploits flaws in the reward function to get high scores without achieving the intended goal

Pass@1: The percentage of problems where the model's first generated answer is correct

Self-consistent illusion: The phenomenon where a model becomes confidently consistent about an incorrect answer, fooling self-consistency reward mechanisms