Learning to Reason without External Rewards

📝 Paper Summary

Reinforcement Learning for LLMs Reasoning

Intuitor improves LLM reasoning by using the model's own internal confidence (self-certainty) as the sole reward signal in reinforcement learning, eliminating the need for ground-truth labels or external verifiers.

Core Problem

Current RL methods for reasoning (RLVR) rely on expensive domain-specific verifiable rewards (like gold solutions or test cases), which limits scalability and generalization to open-ended tasks.

Why it matters:

Gold-standard solutions and formal verification environments are unavailable for many real-world domains
Outcome-based rewards (correct/incorrect) fail to incentivize the underlying reasoning process, limiting transferability
Reliance on human supervision (RLHF) or crafted verifiers constrains autonomous self-improvement for super-human AI

Concrete Example: A model trained with standard RL on math problems might overfit to getting the final answer '42' correct without understanding the method. Intuitor rewards the model for being 'confident' in its generation steps, leading it to spontaneously develop detailed reasoning chains (like pre-code explanations) to increase its own certainty, even when not explicitly prompted to do so.

Key Novelty

Reinforcement Learning from Internal Feedback (RLIF) using Self-Certainty

Replaces external correctness rewards with 'self-certainty'—a metric measuring how confident the model is in its own token predictions relative to a uniform guess
Uses Group Relative Policy Optimization (GRPO) to encourage generation trajectories that maximize this intrinsic confidence, effectively rewarding the model for 'convincing itself'

Architecture

The Intuitor training pipeline integrating self-certainty with GRPO.

Evaluation Highlights

Fine-tuned Qwen2.5-1.5B improves from 0% to 9.9% accuracy on LiveCodeBench solely by training on MATH data with intrinsic rewards
Achieves 65% relative improvement on LiveCodeBench with Qwen2.5-3B compared to the base model, while supervised GRPO shows no improvement on this out-of-domain task
Matches the performance of supervised RL (GRPO with gold answers) on in-domain MATH benchmarks without using any ground truth labels

Breakthrough Assessment

8/10

Demonstrates that internal model signals can substitute for external supervision in reasoning tasks, achieving strong generalization and competitive performance. A significant step toward autonomous self-improving AI.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from Internal Feedback (RLIF) where the policy optimizes an intrinsic reward signal without external labels

Inputs: Input query q (e.g., a math problem)

Outputs: Generated output o (reasoning chain and answer)

Pipeline Flow

Behavior Policy Sampling
Self-Certainty Scoring
Advantage Estimation
Policy Update

System Modules

Behavior Policy

Generate a group of candidate outputs for a given query

Model or implementation: Qwen2.5-1.5B or Qwen2.5-3B

Self-Certainty Scorer

Compute the intrinsic reward for each output based on model confidence

Model or implementation: Policy Model (Self-evaluation)

GRPO Optimizer

Update the policy to maximize self-certainty using relative advantages within the group

Model or implementation: Policy Gradient Optimizer

Novel Architectural Elements

Replacement of external reward mechanism with an internal computation (self-certainty) integrated directly into the GRPO advantage estimation loop

Modeling

Base Model: Qwen2.5-1.5B and Qwen2.5-3B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected intrinsic reward (self-certainty) while staying close to reference policy.

Formally: Maximize E[min(ratio * A, clip(ratio) * A) - beta * KL(policy || ref)] where A is advantage derived from self-certainty.

Training Data:

MATH dataset training split (7,500 problems)
Codeforces dataset (3,200 problems) for code experiments

Key Hyperparameters:

group_size_G: 7 (Math), 14 (Code)
kl_penalty_beta: 0.005 (Math), 0.01 (Code)
learning_rate: 3e-5 (Math), 1e-5 (Code)
+ 2 more
batch_size: 128 queries per update
max_steps: 50 (Code experiments)

Compute: NVIDIA A100 GPUs (40GB)

Comparison to Prior Work

vs. GRPO: Intuitor uses intrinsic self-certainty rewards instead of ground-truth verification
vs. TTRL: Intuitor rewards process confidence rather than consensus answers, enabling better out-of-domain generalization
vs. STaR: Does not require any correctness check to filter data; learns purely from confidence maximization

Limitations

Performance is slightly lower than supervised GRPO on in-domain tasks (MATH)
Requires careful tuning of KL penalty to prevent mode collapse or reward hacking
Offline reward computation leads to exploitation (length explosion); requires online computation

Reproducibility

Code: https://github.com/sunblaze-ucb/Intuitor

Code is publicly available at https://github.com/sunblaze-ucb/Intuitor. Uses Open-R1 framework. Hyperparameters are explicitly listed.

📊 Experiments & Results

Evaluation Setup

Models trained on MATH dataset and evaluated on math (in-domain) and code/instruction-following (out-of-domain) benchmarks.

Benchmarks:

MATH (Mathematical Reasoning)
GSM8K (Grade School Math)
LiveCodeBench (LCB) (Code Generation)
CRUXEval-O (Code Reasoning)
AlpacaEval 2.0 (Instruction Following)

Metrics:

Accuracy (Pass@1)
Length Controlled Win Rate (AlpacaEval)
Statistical methodology: Mann-Whitney U tests used to analyze self-certainty distributions.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Intuitor enables models to learn non-trivial reasoning capabilities from scratch (or near-scratch) without ground truth, significantly outperforming baselines on transfer tasks.
LiveCodeBench	Accuracy	0	9.9	+9.9

Experiment Figures

Evolution of output types on LiveCodeBench during training.

Comparison of Online vs. Offline self-certainty rewards.

Main Takeaways

Intrinsic 'self-certainty' rewards are sufficient to drive learning of complex reasoning behaviors, matching supervised RL on in-domain math tasks.
Intuitor generalizes significantly better than outcome-based RL (GRPO) to out-of-domain tasks (Code Generation), likely because it rewards the reasoning process (confidence) rather than just the final answer.
Qualitative analysis shows emergent 'pre-reasoning' behaviors: models trained with Intuitor spontaneously generate detailed natural language explanations before writing code to increase their own confidence.
Online reward computation is critical; offline rewards lead to reward hacking where the model generates gibberish to inflate confidence scores.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
KL Divergence
Policy Gradients

Key Terms

RLIF: Reinforcement Learning from Internal Feedback—a paradigm where models learn from intrinsic signals derived from their own state rather than external rewards

Self-certainty: A confidence metric defined as the average KL divergence between the model's output distribution and a uniform distribution; higher values indicate the model is more sure of its choice

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs sampled from the same input query, removing the need for a separate value function

RLVR: Reinforcement Learning with Verifiable Rewards—using objective, programmatic checks (like compiling code or matching math answers) as reward signals

RLHF: Reinforcement Learning from Human Feedback—using a reward model trained on human preferences to guide LLM generation

KL penalty: A regularization term preventing the trained model from drifting too far from the reference model's distribution