Learning to Reason for Hallucination Span Detection

📝 Paper Summary

Hallucination Detection Factuality Alignment

RL4HS improves fine-grained hallucination detection by training a reasoning model with reinforcement learning using a span-level F1 reward, outperforming standard supervised baselines.

Core Problem

Most hallucination detection treats the problem as a binary classification task, but real-world applications need to identify specific hallucinated text spans.

Why it matters:

Binary detection is too coarse; users need to know exactly which parts of a summary or answer are unsupported to trust the output
Standard supervised fine-tuning often fails to learn the complex, multi-step reasoning required to verify facts against context
Existing general-purpose reasoning models (like math/code specialists) do not transfer well to the specific task of factual consistency checking

Concrete Example: A restaurant review summary claims a venue 'provides catering services.' A standard model might miss this subtle error if the structured business data lists many attributes but omits 'catering.' RL4HS learns to cross-check this specific claim against the data schema and correctly identifies the span 'provides catering services' as a hallucination.

Key Novelty

Reinforcement Learning for Hallucination Spans (RL4HS)

Treats hallucination span detection as a reasoning task where the model must generate intermediate thought steps before predicting spans
Optimizes the model using Reinforcement Learning (RL) with a reward based on the Span-F1 score (overlap between predicted and ground-truth error spans)
Introduces Class-Aware Policy Optimization (CAPO) to fix a bias where the model learns to predict 'no hallucination' just to get easy rewards

Evaluation Highlights

RL4HS-14B achieves 57.6 F1 on Summarization and 62.6 F1 on Data-to-Text, surpassing both supervised fine-tuning and proprietary models like GPT-5
RL4HS-7B outperforms the larger QwQ-32B reasoning model by a wide margin (e.g., 50.9 vs 19.4 average F1), showing general reasoning doesn't equal hallucination detection skill
Proposed Class-Aware Policy Optimization (CAPO) improves recall significantly over standard GRPO, balancing the precision-recall trade-off

Breakthrough Assessment

8/10

Successfully applies RL to a discriminative/verification task (span detection) rather than just generation. The identification of reward hacking in F1-based RL and the CAPO solution are valuable contributions.

⚙️ Technical Details

Problem Definition

Setting: Conditional Natural Language Generation verification (Hallucination Span Detection)

Inputs: Input context c (e.g., source document) and generated response y

Outputs: List of hallucinated spans S = {[start, end]} in y that are not supported by c

Pipeline Flow

Input Processing (Context + Response)
Reasoning Generation (Chain-of-Thought)
Span Prediction
Reward Calculation (Training only)

System Modules

Reasoning Generator

Analyze the response against the context to identify inconsistencies using Chain-of-Thought

Model or implementation: Qwen2.5-7B/14B-Instruct (fine-tuned)

Span Predictor

Output the specific text segments identified as hallucinations

Model or implementation: Same Qwen model (generates spans after reasoning)

Novel Architectural Elements

Implementation of Class-Aware Policy Optimization (CAPO) within the GRPO framework to handle class imbalance in reward distribution

Modeling

Base Model: Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct

Training Method: Class-Aware Policy Optimization (CAPO), a variant of Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize span-level F1 reward while balancing class contributions.

Formally: Maximizes advantage A_i, where A_i is scaled by alpha if the sample is a non-hallucination, and unscaled otherwise.
Purpose: Define the reward based on overlap with ground truth.

Formally: Reward = Span-F1(Predicted_Spans, Ground_Truth_Spans).

Adaptation: Full fine-tuning

Key Hyperparameters:

alpha (CAPO scaling factor): 0.5
top_p: 0.95
top_k: 20
+ 1 more
temperature: 0.6

Compute: Not reported in the paper

Comparison to Prior Work

vs. Multi-View Attention: RL4HS uses generative reasoning (CoT) and RL optimization rather than attention-based classification features
vs. Standard GRPO: RL4HS introduces class-aware scaling (CAPO) to prevent the model from collapsing to the 'no hallucination' majority class
vs. General Reasoning Models (QwQ, o3): RL4HS is specifically fine-tuned for the verification task, showing that general math/logic reasoning doesn't automatically solve textual hallucination detection

Limitations

Relies on the quality of the Span-F1 metric as a reward signal, which is asymmetric
Requires ground truth span annotations for training (supervised or RL)
Standard GRPO encourages reward hacking (predicting empty spans) without the proposed CAPO modification
Evaluation is limited to RAGTruth benchmark tasks (Summarization, QA, Data-to-Text)

Reproducibility

RAGTruth dataset is public. Code availability is not provided. Model weights are not explicitly linked but base models (Qwen2.5) are open weights. Hyperparameters for generation (top-p, temp) and CAPO alpha are provided.

📊 Experiments & Results

Evaluation Setup

Span-level hallucination detection on three CNLG tasks

Benchmarks:

RAGTruth (Hallucination Span Detection (Summarization, QA, Data-to-Text))

Metrics:

Span-F1
Span-Precision
Span-Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RL4HS consistently outperforms baselines across different model sizes and task types.
RAGTruth (Avg)	F1	50.1	55.9	+5.8
RAGTruth (Avg)	F1	27.3	58.3	+31.0
RAGTruth (Avg)	F1	51.2	58.3	+7.1
Comparison against general-purpose reasoning models shows domain-specific training is superior.
RAGTruth (Avg)	F1	15.3	55.9	+40.6
Ablation of the optimization strategy shows CAPO balances precision and recall better than GRPO.
RAGTruth (Avg)	F1	54.2	55.9	+1.7

Experiment Figures

Span-F1@K performance of pretrained models with and without Chain-of-Thought (CoT) as K (number of samples) increases.

Training dynamics (Recall, Precision, F1) comparing GRPO vs CAPO.

Main Takeaways

Explicit reasoning (CoT) alone via prompting provides limited gains for hallucination detection; RL training is necessary to fully unlock potential.
Standard RL (GRPO) with F1 rewards leads to reward hacking where the model defaults to 'no hallucination' to maximize precision at the cost of recall.
The proposed CAPO method effectively mitigates reward hacking by scaling advantages for the easier non-hallucination class.
In-domain reasoning training is essential; even much larger general reasoning models (like QwQ-32B) perform poorly compared to smaller, task-specific RL4HS models.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Chain-of-Thought (CoT) Reasoning
F1 Score calculation

Key Terms

Hallucination Span Detection: Identifying the exact start and end indices of text in a model's output that are not supported by the source content

RL4HS: Reinforcement Learning for Hallucination Spans—the authors' proposed framework

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples to estimate advantages without a critic network

Span-F1: A metric measuring the character-level overlap between predicted error spans and ground-truth error spans

CAPO: Class-Aware Policy Optimization—the authors' modification to GRPO that scales advantages for non-hallucination classes to prevent the model from ignoring hallucinations

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs using standard cross-entropy loss

Reward Hacking: When an RL agent finds a loophole to maximize the reward function (e.g., predicting 'no error' everywhere) without actually solving the task

RAGTruth: A benchmark dataset containing source documents, model responses, and human-annotated hallucination spans