← Back to Paper List

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Y Wang, Z Liu, X Li, C Lu, C Yang
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
arXiv, 2/2026 (2026)
Reasoning RL QA

📝 Paper Summary

Large Reasoning Models (LRMs) Reinforcement Learning (RL) Chain-of-Thought (CoT) Reasoning
NRT trains reasoning models using only question-answer pairs by treating the reasoning trace as a latent variable and reinforcing traces that improve the model's own uncertainty about the final answer.
Core Problem
Training strong reasoning models typically relies on expensive human-annotated reasoning traces (SFT) or external verifiers (RLVR), limiting applicability to domains like math/code where correctness is objectively checkable.
Why it matters:
  • Dependency on human data is costly and embeds human biases, constraining the model's search for better strategies.
  • Reliance on external verifiers excludes vast domains like open-ended QA, creative writing, and summarization where correctness is subjective.
  • Existing verifier-free methods often suffer from policy collapse, converging to simple, low-entropy outputs.
Concrete Example: In verifiable domains like math, a model can be rewarded if the final answer matches a number. In open-ended QA, no simple check exists. Standard self-rewarding methods might just reward the model for being confident, leading it to output short, trivial nonsense that it is 'sure' about, rather than actual reasoning.
Key Novelty
Native Reasoning Training (NRT)
  • Treats the reasoning trace as a latent variable to be discovered rather than imitated from humans.
  • Uses a unified framework where reasoning is intrinsically rewarded if it increases the model's likelihood of generating the correct ground-truth answer.
  • Introduces novel weighted-sum reward schemes that prioritize 'hard' tokens (where the model is uncertain), forcing the model to reason through difficulties rather than shortcutting.
Evaluation Highlights
  • NRT-WS(-log p) achieves 56.2 average score on Llama-3.1-8B across 9 benchmarks, outperforming the SFT baseline (46.0) by +10.2 points.
  • On GSM8K (math), NRT boosts Llama-3.1-8B from 29.0 (SFT) to 76.0, significantly surpassing the strongest prior verifier-free method (RLPR) which scored 65.0.
  • Robust to policy collapse: unlike baselines that degenerate into short, low-quality traces, NRT maintains high entropy and semantic quality throughout training.
Breakthrough Assessment
9/10
Eliminates the need for both reasoning demonstrations and external verifiers while achieving SOTA results. The shift to latent variable modeling with uncertainty-based rewards is a significant methodological advance.
×