Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

📝 Paper Summary

Unsupervised Learning Reinforcement Learning (RL) Mathematical Reasoning

EMPO improves LLM reasoning capabilities without any supervision by using reinforcement learning to minimize the semantic entropy of generated answers to unlabeled questions.

Core Problem

Enhancing LLM reasoning typically requires expensive supervised data (labeled traces, golden answers) or reward models, limiting scalability.

Why it matters:

Human annotation for complex reasoning tasks is time-consuming and costly
Existing self-supervised methods like self-consistency often suffer from limited performance gains or model collapse
Methods relying on ground-truth verifiers cannot generalize to open-ended tasks where answers are not deterministic

Concrete Example: When a base model answers a complex math question, it might generate five diverse, incorrect reasoning paths (high entropy). Current SFT methods require a human to write the correct path to fix this. EMPO instead penalizes the model for having high semantic uncertainty on unlabeled questions, forcing it to converge on consistent reasoning paths using its own latent capabilities.

Key Novelty

Entropy-Minimized Policy Optimization (EMPO)

Uses Semantic Entropy as an intrinsic reward signal for Reinforcement Learning (RL), removing the need for external verifiers or golden answers
Optimizes the model to favor reasoning traces that yield semantically consistent answers across multiple samples
employs an entropy thresholding mechanism to filter out questions that result in overly high (unreliable) or overly low (trivial) uncertainty

Evaluation Highlights

+17.4% accuracy improvement on mathematical benchmarks using Qwen2.5-Math-7B Base (30.7% -> 48.1%) without any supervised signals
+18.0% accuracy improvement on MMLU-Pro using Qwen2.5-7B Base (32.1% -> 50.1%)
Demonstrates that semantic entropy has a strong negative correlation with model accuracy, validating it as a robust unsupervised proxy for correctness

Breakthrough Assessment

8/10

Significant unsupervised gains (+17-18%) on hard reasoning benchmarks. Proposes a principled way to do RL without ground-truth verifiers, addressing a major bottleneck in scaling post-training.

⚙️ Technical Details

Problem Definition

Setting: Unsupervised optimization of a pre-trained LLM policy using only unlabeled questions

Inputs: Unlabeled user questions {q_i}

Outputs: Reasoning traces and answers with minimized semantic uncertainty

Pipeline Flow

User Question -> LLM Generation -> Output

System Modules

Generator

Generate reasoning traces and answers

Model or implementation: Qwen2.5-Math-7B Base / Qwen2.5-7B Base

Modeling

Base Model: Qwen2.5-Math-7B Base and Qwen2.5-7B Base

Training Method: Entropy-Minimized Policy Optimization (EMPO)

Objective Functions:

Purpose: Minimize semantic entropy by maximizing the advantage of outputs belonging to high-probability meaning clusters.

Formally: Maximize expected advantage A_i, where A_i is derived from the likelihood of the meaning cluster p(c_j|q).
Purpose: Filter unreliable training signals.

Formally: Mask gradients for questions where semantic entropy H is > delta_high (too uncertain) or H < delta_low (trivial/confident).

Training Data:

Unlabeled questions from user queries or benchmark datasets (without using the labels)

Key Hyperparameters:

delta_high: Entropy threshold for filtering uncertain samples
delta_low: Entropy threshold for filtering trivial samples

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1-Zero: EMPO is fully unsupervised (no golden answers/verifiers) vs. rule-based supervision
vs. Self-Consistency: EMPO uses RL to optimize the policy directly vs. SFT on pseudo-data
vs. TTRL [not cited in paper]: TTRL performs optimization at inference time, while EMPO separates training and testing
+ 1 more
vs. Tent [not cited in paper]: Adapts via entropy minimization at test-time for vision, EMPO applies it to RL training for reasoning

Limitations

Relies on the assumption that semantic consistency correlates with correctness (which may fail if the model is consistently wrong)
Requires a mechanism for semantic clustering (N-gram/SLM), which adds complexity compared to simple token matching
Performance depends on the quality of the base model's latent capabilities (cannot teach fundamentally new skills, only elicit existing ones)

Reproducibility

Code is stated to be available at a URL (placeholder in text). The method relies on semantic clustering which can be implemented via N-gram, regex, or Small Language Models (SLM). Specific SLM details for clustering are discussed in Appendix F.

📊 Experiments & Results

Evaluation Setup

Evaluation on mathematical and free-form reasoning tasks using accuracy metrics.

Benchmarks:

Mathematical Benchmarks (Mathematical Reasoning)
MMLU-Pro (Free-form Natural Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EMPO significantly boosts accuracy over base models on both math and general reasoning benchmarks without supervision.
Mathematical Benchmarks	Accuracy	30.7	48.1	+17.4
MMLU-Pro	Accuracy	32.1	50.1	+18.0

Main Takeaways

EMPO achieves competitive performance compared to supervised counterparts on both math and free-form reasoning.
Semantic entropy serves as a potent intrinsic reward signal, showing a strong negative correlation with model accuracy.
The method works by selecting and prioritizing strong, pre-existing reasoning pathways learned during pre-training rather than teaching new skills from scratch.
Entropy thresholding helps stabilize unsupervised training by filtering out unreliable traces.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically Policy Optimization)
Information Theory (Entropy)
Large Language Model Post-training (SFT/RLHF)

Key Terms

Semantic Entropy: A measure of uncertainty that groups generated outputs by meaning (clusters) rather than exact token matching; high entropy implies the model is confused about the meaning.

EMPO: Entropy-Minimized Policy Optimization—the proposed unsupervised RL algorithm that uses semantic consistency as a reward.

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs.

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to stabilize training (used by DeepSeek-R1-Zero).

Golden Answers: Verified, correct ground-truth answers used for supervision or evaluation.

RL: Reinforcement Learning—training an agent (LLM) to maximize a reward signal.

Latent Semantic Space: The conceptual space where outputs with the same meaning are grouped together, regardless of their exact phrasing.