Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models

📝 Paper Summary

Reinforcement Learning for Reasoning Hallucination Suppression

FSPO mitigates hallucinations in reasoning models by verifying intermediate steps against evidence and adjusting token-level RL advantages, rather than relying solely on final-answer correctness.

Core Problem

Outcome-based RL fine-tuning for reasoning tasks exacerbates hallucinations because models can learn incorrect intermediate reasoning steps that coincidentally lead to correct answers (spurious local optima) or generate high-entropy confident errors.

Why it matters:

Models trained with standard RL (like DeepSeek-R1) show significantly higher rates of fabricated statements across benchmarks like TruthfulQA and HaluEval
Optimizing only for the final answer creates sparse rewards and high-variance gradients, making it difficult for models to learn faithful reasoning patterns
Unreliable reasoning chains undermine trust, even if the final answer is correct, as the model may justify its output with false claims

Concrete Example: In a HaluEval-QA case, DeepSeek-R1 answers correctly but generates a reasoning chain with fabricated facts, whereas the base model DeepSeek-V3 does not. The RL-tuned model is 'confidently wrong' in its reasoning path because it was only penalized for the final outcome.

Key Novelty

Factuality-aware Step-wise Policy Optimization (FSPO)

Integrates an automated verifier into the RL loop that checks each generated reasoning sentence against external evidence (e.g., Wikipedia)
Modifies the standard advantage function by re-weighting token-level advantages based on step-wise factuality scores (entailed vs. contradicted)
Provides dense feedback signals to the policy, ensuring that valid reasoning steps are rewarded even if the final answer is wrong, and fabricated steps are penalized even if the final answer is right

Architecture

Overview of the Factuality-aware Step-wise Policy Optimization (FSPO) framework.

Evaluation Highlights

FSPO significantly reduces hallucination rates on TruthfulQA compared to standard RL baselines using Qwen2.5 and Llama-3.1 models
Improves mathematical reasoning accuracy on challenging benchmarks while maintaining higher factuality than purely outcome-driven RL methods
Enhances the reliability of intermediate reasoning steps without compromising the fluency or quality of the generated text

Breakthrough Assessment

8/10

Addresses a critical and timely failure mode of current reasoning models (hallucination induction via RL) with a theoretically grounded and effective solution. The step-wise verification approach directly tackles the sparsity of outcome-based rewards.

⚙️ Technical Details

Problem Definition

Setting: Token-level Markov decision process for language generation where the policy generates a sequence of reasoning steps followed by a final answer

Inputs: Input prompt x (question) and associated evidence K (e.g., Wikipedia snippets)

Outputs: Generated response y containing intermediate reasoning steps z and final answer

Pipeline Flow

Policy Model (generates reasoning + answer)
Step-wise Factuality Verifier (checks reasoning steps against evidence)
Answer Correctness Evaluator (checks final answer)
Advantage Adjustment (combines signals for RL update)

System Modules

Policy Model

Generate reasoning steps and final answer for the given prompt

Model or implementation: Qwen2.5-7B-Instruct or Llama-3.1-8B-Instruct

Step-wise Verifier (Evaluation)

Determine if each reasoning sentence is entailed by evidence

Model or implementation: LLM-based verifier (specific model not detailed in text, likely similar size)

Answer Evaluator (Evaluation)

Check correctness of final answer

Model or implementation: Rule-based checker

Novel Architectural Elements

Integration of a step-wise factuality verification loop within the GRPO optimization process
Dynamic token-level advantage adjustment mechanism that re-weights tokens based on the verification of the specific sentence they belong to

Modeling

Base Model: Qwen2.5-7B-Base/Instruct and Llama-3.1-8B-Instruct

Training Method: Factuality-aware Step-wise Policy Optimization (FSPO) building on GRPO

Objective Functions:

Purpose: Optimize policy to maximize expected reward using group-relative advantages.

Formally: GRPO objective maximizing sum of min(ratio * A, clip(ratio) * A) - beta * KL.
Purpose: Calculate total reward combining outcome and factuality.

Formally: R(y) = R_outcome + sum(R_factuality_step)
Purpose: Adjust advantage for specific tokens based on local factuality.

Formally: A_hat_i,t = A_i + lambda * R_factuality(z_j) for tokens in sentence z_j

Adaptation: Full fine-tuning

Training Data:

Synthesized long-CoT data from DeepSeek-R1 used for distillation
Standard math and hallucination benchmarks for evaluation

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: Adds explicit step-wise factuality verification and token-level advantage adjustment to prevent hallucination, whereas R1 relies on outcome rewards which can induce hallucinations.
vs. Standard GRPO: Modifies the advantage calculation to be token-specific based on local factuality, rather than assigning a single advantage score to the entire sequence.
vs. Kongzi [cited in paper]: Addresses the specific issue of fluent but hallucinated reasoning chains in RL-aligned models.

Limitations

Relies on the availability of high-quality evidence (K) for verification, which may not always be present for all tasks.
Dependent on the accuracy of the automated verifier; verifier errors could mislead the policy.
Computational cost is higher than standard RL due to the need for step-wise verification during training.
Hyperparameters for the advantage adjustment (lambda) and training details are not fully specified.

Reproducibility

No replication artifacts mentioned in the paper. Code URL, hyperparameters (learning rate, batch size), and specific verifier model details are missing.

📊 Experiments & Results

Evaluation Setup

Evaluation on mathematical reasoning and hallucination benchmarks

Benchmarks:

TruthfulQA (Hallucination evaluation (Generation task))
HaluEval (Hallucination evaluation (QA subset))
HalluQA (Hallucination evaluation)
Mathematical Reasoning Benchmarks (Math reasoning)

Metrics:

Hallucination Rate (lower is better)
Accuracy (mathematical reasoning)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Preliminary experiments showing the negative impact of standard RL on factuality.
TruthfulQA	Hallucination Rate	Not reported in the paper	Not reported in the paper	Not reported in the paper
Main results demonstrating FSPO's effectiveness.
Mathematical Reasoning	Performance	Not reported in the paper	Not reported in the paper	Not reported in the paper
Hallucination Benchmarks	Hallucination Rate	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Comparison of hallucination rates on TruthfulQA, HaluEval, and HalluQA for models with and without RL/CoT training (DeepSeek-V3 vs R1, Qwen vs QwQ, etc.).

Analysis of error sources in HaluEval-QA for DeepSeek-R1.

Main Takeaways

Standard outcome-based RL (like in DeepSeek-R1) significantly increases hallucination rates compared to base models, as models learn to fabricate reasoning to satisfy the final answer.
Hallucinations in RL-trained models primarily stem from incorrect intermediate reasoning steps rather than just the final answer generation.
FSPO effectively mitigates this by providing dense, step-wise factuality signals, preventing the model from learning 'confidently wrong' patterns.
The method improves both reasoning accuracy and factuality simultaneously, suggesting that faithful reasoning is compatible with high performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Policy Gradient, REINFORCE)
Large Language Models (LLMs) and Chain-of-Thought (CoT) prompting
Understanding of reward modeling in RLHF

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

RL: Reinforcement Learning—a machine learning paradigm where an agent learns to make decisions by performing actions and receiving rewards

GRPO: Group Relative Policy Optimization—a policy gradient algorithm that estimates advantages by normalizing rewards within a group of sampled outputs for the same input, removing the need for a separate value network

PPO: Proximal Policy Optimization—a standard RL algorithm that updates policies using a clipped objective to prevent large, unstable updates

hallucination: The generation of factually incorrect, nonsensical, or unfaithful content by a language model

policy gradient: An optimization technique in RL that updates the policy parameters in the direction of the gradient of expected reward

advantage: A value measuring how much better a specific action is compared to the average action in a given state

entropy: A measure of randomness or uncertainty in the model's predictions; high entropy implies the model is exploring many possibilities

spurious local optima: Suboptimal solutions where the model converges to a behavior (like confidently outputting a wrong answer) that yields zero reward but lacks gradient signal to correct itself

REINFORCE: A fundamental policy gradient algorithm that uses Monte Carlo sampling to estimate gradients

entailment: A logical relationship where the truth of one statement (evidence) guarantees the truth of another (generated text)