Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

📝 Paper Summary

Hallucination suppression

TOHA detects hallucinations in RAG systems by analyzing the topological divergence between prompt and response subgraphs within specific 'hallucination-aware' attention heads.

Core Problem

Existing hallucination detection methods are either computationally expensive (requiring multiple generations) or require large annotated datasets for supervised training, which are often scarce.

Why it matters:

Hallucinations undermine user trust in sensitive applications, necessitating reliable detection mechanisms for safe deployment
Computational overhead of sampling-based methods (like SelfCheckGPT) limits real-time applicability
Supervised methods struggle with domain transfer due to the scarcity of high-quality annotated hallucination datasets

Concrete Example: In a RAG scenario, if a model hallucinates an answer not present in the retrieved context, standard probability metrics might still be high. TOHA detects this because the topological structure of the attention graph shows a high divergence (novelty) between the prompt and the generated response in specific attention heads, signaling that the response is not grounded in the prompt.

Key Novelty

TOpology-based HAllucination detector (TOHA)

Adapts Manifold Topology Divergence to graph structures (MTop-Div) to measure the topological dissimilarity between prompt and response tokens in attention maps
Identifies a small set of 'hallucination-aware' attention heads that consistently show higher divergence for hallucinations, regardless of the dataset
Interpret the divergence score as a measure of informational novelty: high divergence implies the response introduces information not topologically grounded in the prompt

Architecture

Conceptual illustration of Attention Graph construction and MTop-Div calculation.

Evaluation Highlights

+11.7% improvement on MS MARCO (long-form QA) for Mistral-7B compared to state-of-the-art baselines
+21.6% improvement on CoQA (conversational QA) for LLaMA-2-7B compared to baselines
Operates ~7x faster than SelfCheckGPT (with 1 additional sample) and >70x faster than standard sampling-based configurations

Breakthrough Assessment

8/10

Offers a highly efficient, training-free method that matches or beats computationally expensive baselines. The application of TDA to attention graphs for this purpose is novel and theoretically grounded.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of model responses as either 'hallucinated' or 'grounded' in a RAG setting

Inputs: Prompt P, Response R, and internal attention maps from the LLM

Outputs: Hallucination score (scalar)

Pipeline Flow

Attention Extraction
Head Selection (Calibration Phase)
Divergence Computation (Inference Phase)
Scoring

System Modules

Attention Extractor

Extracts attention maps from the LLM for a given prompt and response

Model or implementation: Target LLM (e.g., LLaMA-2, Mistral)

Head Selector

Identifies 'hallucination-aware' heads using a small annotated probe set

Model or implementation: Statistical ranking algorithm

Topology Analyzer (Inference)

Computes MTop-Div score for the selected heads

Model or implementation: 0-th order homology calculation (equivalent to MSF length)

Scorer (Inference)

Aggregates divergence scores into a final hallucination metric

Model or implementation: Averaging

Novel Architectural Elements

Use of Topological Divergence (MTop-Div) on attention graphs as a hallucination proxy
Head selection strategy based on topological separation capability

Modeling

Base Model: Evaluated on LLaMA-2-7B-chat, LLaMA-2-13B-chat, LLaMA-3.1-8B-Instruct, Mistral-7B-Instruct-v0.1, Qwen2.5-7B-Instruct

Training Method: Training-free (calibration only)

Adaptation: None (inference-time analysis)

Training Data:

Uses small probe sets (e.g., 50 samples) for head selection

Key Hyperparameters:

N_max: 10 (maximum number of heads to select)

Compute: Inference-only; roughly 7x faster than SelfCheckGPT with 1 sample; computationally light (calculating MST on small graphs)

Comparison to Prior Work

vs. SelfCheckGPT: TOHA is training-free and single-generation (much faster), whereas SelfCheckGPT requires multiple expensive generations
vs. HaloScope: TOHA uses topological features of attention specifically, rather than general hidden state features
vs. Supervised Classifiers: TOHA requires minimal data (probe set) and is robust to distribution shifts, unlike supervised methods that often fail to transfer

Limitations

Requires white-box access to attention maps (not applicable to closed APIs like GPT-4)
Performance depends on the existence of 'hallucination-aware' heads, which must be identified via a probe set
Quadratic complexity of attention graph construction with respect to sequence length (though usually manageable for standard context windows)

Reproducibility

Code: https://anonymous.4open.science/r/tda4hallucinations-C449

Code publicly available. Method relies on access to internal attention weights, which restricts use with black-box APIs. Experiments use standard open datasets (RAGTruth, CoQA, SQuAD, etc.).

📊 Experiments & Results

Evaluation Setup

Hallucination detection on RAG and QA tasks using open-source LLMs

Benchmarks:

MS MARCO (RAGTruth) (Long-form QA)
CNN/DM (RAGTruth) (Summarization)
CoQA (Conversational QA)
SQuAD (Reading Comprehension)
XSum (Extreme Summarization)
HotpotQA (Multi-hop QA)

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
Statistical methodology: Results averaged over 5 runs with different data splits

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on RAGTruth benchmarks showing TOHA's performance against baselines.
MS MARCO (Mistral-7B)	AUROC	73.2	84.9	+11.7
CoQA (LLaMA-2-7B)	AUROC	57.4	79.0	+21.6
SQuAD (LLaMA-2-13B)	AUROC	78.4	85.8	+7.4
Efficiency comparison showing speedups.
Runtime Analysis	Seconds per sample	1.37	0.19	-1.18
Transferability experiments checking robustness across datasets.
XSum (Transfer from CNN/DM)	AUROC	76.4	75.8	-0.6

Experiment Figures

Scatter plot of attention heads based on their ability to separate hallucinated vs. grounded samples across different datasets.

Runtime comparison and sensitivity analysis for N_max.

Main Takeaways

TOHA consistently matches or outperforms state-of-the-art baselines (including expensive consistency checks) while being much faster.
Identified 'hallucination-aware' heads are stable across datasets and often correlate with 'copying' heads (heads that attend to previous tokens).
The method is robust to the size of the probe set, maintaining performance even with as few as 50 annotated samples.
Topological divergence effectively captures the 'novelty' of generated content relative to the prompt, which serves as a strong proxy for hallucination in RAG.

📚 Prerequisite Knowledge

Prerequisites

Self-attention mechanism in Transformers
Graph theory (Minimum Spanning Forest)
Topological Data Analysis (TDA) basics (barcodes, homology)

Key Terms

MTop-Div: Manifold Topology Divergence—a metric measuring the topological difference between two datasets (here, prompt and response tokens) by analyzing the connectivity of their union

Attention Graph: A complete weighted graph where nodes are tokens and edge weights are derived from attention scores (1 - attention weight), representing pseudo-distances

Vietoris-Rips complex: A sequence of simplicial complexes built from a metric space at increasing distance thresholds, used to compute topological features like connected components

MSF: Minimum Spanning Forest—a collection of Minimum Spanning Trees, here used to connect response tokens to the prompt set in the attention graph

Barcode: A graphical representation of topological features (like connected components) persisting across different scales; the sum of interval lengths serves as the divergence score

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

SelfCheckGPT: A consistency-based hallucination detection method that samples multiple responses to check for factual agreement

TDA: Topological Data Analysis—a field of data analysis using techniques from topology to study the shape and structure of data