← Back to Paper List

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian
Tianjin University, Tencent AI Lab, National University of Singapore
arXiv.org (2025)
Reasoning RL Factuality

📝 Paper Summary

Unsupervised Learning Reinforcement Learning (RL) Mathematical Reasoning
EMPO improves LLM reasoning capabilities without any supervision by using reinforcement learning to minimize the semantic entropy of generated answers to unlabeled questions.
Core Problem
Enhancing LLM reasoning typically requires expensive supervised data (labeled traces, golden answers) or reward models, limiting scalability.
Why it matters:
  • Human annotation for complex reasoning tasks is time-consuming and costly
  • Existing self-supervised methods like self-consistency often suffer from limited performance gains or model collapse
  • Methods relying on ground-truth verifiers cannot generalize to open-ended tasks where answers are not deterministic
Concrete Example: When a base model answers a complex math question, it might generate five diverse, incorrect reasoning paths (high entropy). Current SFT methods require a human to write the correct path to fix this. EMPO instead penalizes the model for having high semantic uncertainty on unlabeled questions, forcing it to converge on consistent reasoning paths using its own latent capabilities.
Key Novelty
Entropy-Minimized Policy Optimization (EMPO)
  • Uses Semantic Entropy as an intrinsic reward signal for Reinforcement Learning (RL), removing the need for external verifiers or golden answers
  • Optimizes the model to favor reasoning traces that yield semantically consistent answers across multiple samples
  • employs an entropy thresholding mechanism to filter out questions that result in overly high (unreliable) or overly low (trivial) uncertainty
Evaluation Highlights
  • +17.4% accuracy improvement on mathematical benchmarks using Qwen2.5-Math-7B Base (30.7% -> 48.1%) without any supervised signals
  • +18.0% accuracy improvement on MMLU-Pro using Qwen2.5-7B Base (32.1% -> 50.1%)
  • Demonstrates that semantic entropy has a strong negative correlation with model accuracy, validating it as a robust unsupervised proxy for correctness
Breakthrough Assessment
8/10
Significant unsupervised gains (+17-18%) on hard reasoning benchmarks. Proposes a principled way to do RL without ground-truth verifiers, addressing a major bottleneck in scaling post-training.
×