Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

📝 Paper Summary

Large Reasoning Models (LRMs) Reinforcement Learning (RL) Chain-of-Thought (CoT) Reasoning

NRT trains reasoning models using only question-answer pairs by treating the reasoning trace as a latent variable and reinforcing traces that improve the model's own uncertainty about the final answer.

Core Problem

Training strong reasoning models typically relies on expensive human-annotated reasoning traces (SFT) or external verifiers (RLVR), limiting applicability to domains like math/code where correctness is objectively checkable.

Why it matters:

Dependency on human data is costly and embeds human biases, constraining the model's search for better strategies.
Reliance on external verifiers excludes vast domains like open-ended QA, creative writing, and summarization where correctness is subjective.
Existing verifier-free methods often suffer from policy collapse, converging to simple, low-entropy outputs.

Concrete Example: In verifiable domains like math, a model can be rewarded if the final answer matches a number. In open-ended QA, no simple check exists. Standard self-rewarding methods might just reward the model for being confident, leading it to output short, trivial nonsense that it is 'sure' about, rather than actual reasoning.

Key Novelty

Native Reasoning Training (NRT)

Treats the reasoning trace as a latent variable to be discovered rather than imitated from humans.
Uses a unified framework where reasoning is intrinsically rewarded if it increases the model's likelihood of generating the correct ground-truth answer.
Introduces novel weighted-sum reward schemes that prioritize 'hard' tokens (where the model is uncertain), forcing the model to reason through difficulties rather than shortcutting.

Evaluation Highlights

NRT-WS(-log p) achieves 56.2 average score on Llama-3.1-8B across 9 benchmarks, outperforming the SFT baseline (46.0) by +10.2 points.
On GSM8K (math), NRT boosts Llama-3.1-8B from 29.0 (SFT) to 76.0, significantly surpassing the strongest prior verifier-free method (RLPR) which scored 65.0.
Robust to policy collapse: unlike baselines that degenerate into short, low-quality traces, NRT maintains high entropy and semantic quality throughout training.

Breakthrough Assessment

9/10

Eliminates the need for both reasoning demonstrations and external verifiers while achieving SOTA results. The shift to latent variable modeling with uncertainty-based rewards is a significant methodological advance.

⚙️ Technical Details

Problem Definition

Setting: Training a reasoning policy using only a dataset of question-answer pairs (x, y*) without ground-truth reasoning traces z*.

Inputs: Input question x

Outputs: Reasoning trace z and final answer y

Pipeline Flow

Generator (Produces K reasoning traces + answers)
Reward Calculation (Computes intrinsic reward based on answer token probabilities)
Update (Updates policy using GRPO)

System Modules

Policy Model

Generate reasoning trace z and final answer y given question x

Model or implementation: Llama-3.2-3B, Mistral-7B-v0.3, or Llama-3.1-8B

Reward Aggregator

Calculate intrinsic reward based on the probability of the ground-truth answer tokens given the generated trace

Model or implementation: Mathematical function (Geometric Mean or Weighted Sum)

Novel Architectural Elements

Intrinsic reward mechanism that weights answer token probabilities by their difficulty (inverse baseline probability) to guide the latent reasoning process

Modeling

Base Model: Llama-3.2-3B, Mistral-7B-v0.3, Llama-3.1-8B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward of reasoning traces that aid answer prediction.

Formally: J(θ) = E[R(z, θ)] where R is an aggregation of answer token probabilities.
Purpose: Enforce output structure (trace vs answer).

Formally: L_format (cross-entropy on special start/end tokens).

Training Data:

200K random subset of Tulu-3-sft-mixture
Only (question, answer) pairs used; no reasoning traces

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 256
max_generation_length: 2048
+ 2 more
format_loss_weight: 0.3
group_size_K: Not explicitly reported in the paper (implied K>1 for GRPO)

Comparison to Prior Work

vs. SFT: NRT does not require human reasoning demonstrations.
vs. RLVR: NRT does not require external verifiers, applicable to any QA task.
vs. RLPR/JLB: NRT uses Weighted Sum or Geometric Mean aggregation to prevent dominance by easy tokens and avoid policy collapse.
+ 1 more
vs. STaR [not cited in paper]: STaR filters correct reasoning traces and fine-tunes on them; NRT uses RL with soft probabilistic rewards rather than binary filtering.

Limitations

Requires ground-truth final answers y* (cannot train on purely unlabeled prompts).
Computational cost of generating multiple traces (K) per prompt during training.
Reliance on the model's own probability estimates might still be susceptible to calibration issues, though less than proxy judges.

Reproducibility

Dataset (Tulu-3 subset) and base models are public. Code availability is not provided in the text. Hyperparameters for GRPO (LR, batch size) are provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot reasoning across diverse domains

Benchmarks:

BBH (General Reasoning)
MMLU (General Knowledge/Reasoning)
GSM8K (Math Word Problems)
MATH (Challenging Math Problems)
HumanEval (Code Generation)
DROP (Reading Comprehension)
PopQA (Question Answering)
TruthfulQA (Question Answering)
IFEval (Instruction Following)

Metrics:

Accuracy (Exact Match or equivalent)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison showing NRT's dominance across different model sizes.
Average (9 benchmarks)	Score	46.0	56.2	+10.2
Average (9 benchmarks)	Score	50.8	56.2	+5.4
Average (9 benchmarks)	Score	36.4	39.9	+3.5
Performance on specific challenging reasoning tasks (Math and General Reasoning).
GSM8K	Accuracy	29.0	76.0	+47.0
BBH	Accuracy	41.2	54.3	+13.1
HumanEval	Pass@1	52.4	63.4	+11.0

Experiment Figures

Evolution of reasoning trace metrics (entropy, length, quality) during training for RLPR vs NRT variants

Confidence improvement on answer tokens stratified by difficulty (baseline entropy)

Main Takeaways

NRT consistently outperforms SFT and prior verifier-free RL methods across diverse benchmarks (Math, Coding, QA).
Weighted Sum (WS) reward schemes focusing on difficult tokens are superior to Arithmetic Mean (RLPR) or Sequence Probability (JLB/Verifree) schemes.
The method is robust to policy collapse: baselines like RLPR degrade into short, repetitive outputs, while NRT maintains long, high-entropy, and meaningful reasoning traces.
Analysis of token probabilities shows NRT specifically improves confidence on 'hard' tokens (high entropy) where SFT struggles, confirming the reward shaping works as intended.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient, PPO/GRPO)
Language Modeling (Next-token prediction)
Chain-of-Thought Reasoning

Key Terms

NRT: Native Reasoning Training—the proposed framework for training reasoning as a latent variable without external verifiers.

SFT: Supervised Fine-Tuning—training a model to mimic human demonstrations.

RLVR: Reinforcement Learning with Verifiable Rewards—using external checkers (like code compilers) to reward correct answers.

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples for the same prompt to reduce variance.

Latent Variable: A variable that is not directly observed in the data (here, the reasoning trace) but is inferred or generated during the process.

Policy Collapse: A failure mode in RL where the model converges to producing simple, repetitive, or degenerate outputs to game the reward system.

SOTA: State-of-the-Art—the current best performance in a field.