← Back to Paper List

Learning to Reason for Hallucination Span Detection

Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh, Cem Koc, Joseph Yitan Cheng, Oncel Tuzel, Raviteja Vemulapalli
National Taiwan University, Apple
arXiv (2025)
Factuality RL Reasoning Benchmark

📝 Paper Summary

Hallucination Detection Factuality Alignment
RL4HS improves fine-grained hallucination detection by training a reasoning model with reinforcement learning using a span-level F1 reward, outperforming standard supervised baselines.
Core Problem
Most hallucination detection treats the problem as a binary classification task, but real-world applications need to identify specific hallucinated text spans.
Why it matters:
  • Binary detection is too coarse; users need to know exactly which parts of a summary or answer are unsupported to trust the output
  • Standard supervised fine-tuning often fails to learn the complex, multi-step reasoning required to verify facts against context
  • Existing general-purpose reasoning models (like math/code specialists) do not transfer well to the specific task of factual consistency checking
Concrete Example: A restaurant review summary claims a venue 'provides catering services.' A standard model might miss this subtle error if the structured business data lists many attributes but omits 'catering.' RL4HS learns to cross-check this specific claim against the data schema and correctly identifies the span 'provides catering services' as a hallucination.
Key Novelty
Reinforcement Learning for Hallucination Spans (RL4HS)
  • Treats hallucination span detection as a reasoning task where the model must generate intermediate thought steps before predicting spans
  • Optimizes the model using Reinforcement Learning (RL) with a reward based on the Span-F1 score (overlap between predicted and ground-truth error spans)
  • Introduces Class-Aware Policy Optimization (CAPO) to fix a bias where the model learns to predict 'no hallucination' just to get easy rewards
Evaluation Highlights
  • RL4HS-14B achieves 57.6 F1 on Summarization and 62.6 F1 on Data-to-Text, surpassing both supervised fine-tuning and proprietary models like GPT-5
  • RL4HS-7B outperforms the larger QwQ-32B reasoning model by a wide margin (e.g., 50.9 vs 19.4 average F1), showing general reasoning doesn't equal hallucination detection skill
  • Proposed Class-Aware Policy Optimization (CAPO) improves recall significantly over standard GRPO, balancing the precision-recall trade-off
Breakthrough Assessment
8/10
Successfully applies RL to a discriminative/verification task (span detection) rather than just generation. The identification of reward hacking in F1-based RL and the CAPO solution are valuable contributions.
×