Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

📝 Paper Summary

Label-free Self-Improvement Reinforcement Learning for Reasoning

Evol-RL prevents diversity collapse in label-free self-training by combining a stability reward (majority vote) with a novelty reward (reasoning path embedding similarity) to mimic evolutionary selection and variation.

Core Problem

Label-free self-improvement relying on internal consistency signals (like majority voting) drives models toward over-confident, narrow solutions, causing 'entropy collapse' where solution diversity and reasoning complexity degrade.

Why it matters:

Real-world deployment requires learning from unlabeled data where ground truth verifiers are unavailable
Current majority-driven methods actively punish correct but non-mainstream reasoning, reducing the model's ability to explore and causing performance on multi-attempt metrics (pass@n) to drop over time
Models trained this way exhibit shorter, less complex reasoning chains, effectively memorizing simple paths rather than learning robust reasoning

Concrete Example: Under traditional Test-Time Reinforcement Learning (TTRL), a model trained on math problems might slightly improve its single-attempt accuracy (pass@1) but see its pass@16 drop significantly because it converges to a single, repetitive solution path, losing the ability to find alternative correct answers.

Key Novelty

Evol-RL (Evolution-Oriented Label-free RL)

Applies evolutionary principles to RL: uses majority voting as 'Selection' to anchor correctness and a new intrinsic reward as 'Variation' to drive exploration
Calculates a 'novelty score' for each response based on the semantic distance (embedding similarity) of its reasoning trace from other concurrent responses, rewarding unique reasoning paths even if they yield the same final answer

Architecture

The Evol-RL framework: a policy generates a group of responses, which are scored by a majority vote (Selection) and a novelty estimator (Variation) based on embedding similarity, then updated via GRPO.

Evaluation Highlights

Triples pass@1 accuracy on AIME25 (4.6% → 16.4%) and doubles pass@16 (18.5% → 37.9%) with Qwen3-4B-Base trained on label-free AIME24
Outperforms majority-only baseline (TTRL) by +24.2% on AIME24 pass@16 with 4B model, reversing the typical diversity decline
Achieves strong out-of-domain generalization: 4B model trained on simple MATH-500 matches the AIME24 performance of models trained directly on AIME24

Breakthrough Assessment

8/10

Addresses the critical 'entropy collapse' failure mode in self-training with a theoretically grounded, bio-inspired solution that delivers massive empirical gains on hard reasoning benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Label-free Reinforcement Learning for Reasoning

Inputs: Natural language problem statement q (without ground truth label)

Outputs: Reasoning chain and final answer o

Pipeline Flow

Policy Sampling: Generate G responses for prompt q
Reward Calculation: Selection (Majority Vote) + Variation (Novelty Score)
Optimization: GRPO update with asymmetric clipping

System Modules

Policy Model

Generate a group of G reasoning paths and answers for a given prompt

Model or implementation: Qwen3-4B-Base

Reward Calculator

Compute rewards based on validity, majority agreement, and semantic novelty

Model or implementation: Deterministic algorithm + Embedding Model

Optimizer

Update policy weights using relative advantages

Model or implementation: GRPO (Group Relative Policy Optimization)

Novel Architectural Elements

Integration of semantic novelty (embedding similarity) directly into the reward function to penalize redundancy within the generated group
Three-tier reward banding: Invalid < Minority < Majority (with novelty refining the score within Majority/Minority bands)

Modeling

Base Model: Qwen3-4B-Base

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward relative to group average.

Formally: GRPO surrogate objective with clipped advantages.
Purpose: Incentivize semantic diversity.

Formally: r_novelty = 1 - norm(alpha * mean_sim + beta * max_sim), where sim is cosine similarity of reasoning embeddings.
Purpose: Maintain token-level diversity.

Formally: Token-level entropy regularization added to the total loss.

Adaptation: Full fine-tuning (implied by context of RL training)

Training Data:

MATH-TRAIN (standard large training set)
MATH-500 (small subset)
AIME24 (competition-level small set)

Key Hyperparameters:

clipping: Asymmetric (epsilon_high > epsilon_low)
optimization_algorithm: GRPO

Compute: Not reported in the paper

Comparison to Prior Work

vs. TTRL: Evol-RL adds a novelty reward term to penalize semantic redundancy, preventing the diversity collapse observed in TTRL.
vs. Entropy Minimization: Evol-RL explicitly encourages diversity (entropy) via novelty rewards rather than minimizing it.
vs. Quality-Diversity (MAP-Elites) [not cited in paper]: Evol-RL integrates diversity directly into the gradient-based update of a single policy via rewards, rather than maintaining an archive of diverse solutions.

Limitations

Reliance on majority vote limits correctness anchor; if the majority is wrong, the model may drift (though novelty helps escape local optima).
Computation of embeddings for every response in the group adds overhead compared to simple text-matching rewards (analysis in Appendix B.5).
Requires asymmetric clipping and entropy regularization as enablers on harder tasks; novelty reward alone is less effective on complex domains.

Reproducibility

Code availability is not provided in the paper text. Detailed reward formulations and algorithm logic are described in Section 3. Base models (Qwen3-4B-Base) are standard.

📊 Experiments & Results

Evaluation Setup

Label-free training on math problem statements, evaluated on held-out math and reasoning benchmarks.

Benchmarks:

AIME24 (Competition Math)
AIME25 (Competition Math)
MATH-500 (General Math)
GPQA-Diamond (General Science/Reasoning)

Metrics:

pass@1
pass@16
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on AIME25 using Qwen3-4B-Base trained on AIME24 (Label-Free) show Evol-RL dramatically improving over the TTRL baseline.
AIME25	pass@1	4.6	16.4	+11.8
AIME25	pass@16	18.5	37.9	+19.4
AIME24	pass@16	Not reported in the paper	Not reported in the paper	+24.2
GPQA-Diamond	pass@16	Not reported in the paper	Not reported in the paper	+15.0

Experiment Figures

Comparison of pass@1, pass@16, and response length trends during training for TTRL vs. Evol-RL.

Main Takeaways

Evol-RL reverses the 'pass@n' decline observed in TTRL, where models optimize for a single majority answer at the expense of diversity.
The method improves out-of-domain generalization: models trained on simple MATH-500 transfer effectively to hard AIME tasks, unlike baselines.
Novelty rewards are most critical on easier datasets to prevent early lock-in, while entropy regularization is essential for harder tasks to enable initial exploration.
Maintains response length and complexity, whereas majority-only baselines tend to output shorter, simpler responses over time.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Self-Consistency / Majority Voting
Entropy Regularization
Vector Embeddings and Cosine Similarity

Key Terms

GRPO: Group Relative Policy Optimization—a policy gradient algorithm that evaluates responses relative to a group of peers rather than using a learned value function

RLVR: Reinforcement Learning from Verifiable Rewards—using RL to train models on tasks where correctness can be automatically checked (e.g., math, code)

TTRL: Test-Time Reinforcement Learning—used here as a baseline referring to iterative self-improvement using majority voting on unlabeled data

pass@n: A metric measuring the probability that at least one correct solution exists among n generated samples

entropy collapse: A phenomenon where a model's probability distribution concentrates on a single output or narrow mode, losing the diversity needed for exploration

Reasoning Trace: The step-by-step logical text generated by the model before the final answer