Hallucination Span Detection: Identifying the exact start and end indices of text in a model's output that are not supported by the source content
RL4HS: Reinforcement Learning for Hallucination Spans—the authors' proposed framework
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples to estimate advantages without a critic network
Span-F1: A metric measuring the character-level overlap between predicted error spans and ground-truth error spans
CAPO: Class-Aware Policy Optimization—the authors' modification to GRPO that scales advantages for non-hallucination classes to prevent the model from ignoring hallucinations
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs using standard cross-entropy loss
Reward Hacking: When an RL agent finds a loophole to maximize the reward function (e.g., predicting 'no error' everywhere) without actually solving the task
RAGTruth: A benchmark dataset containing source documents, model responses, and human-annotated hallucination spans