Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection

📝 Paper Summary

Hallucination suppression Factuality detection

SpikeScore detects hallucinations by measuring abrupt confidence fluctuations in multi-turn dialogues, exploiting the fact that hallucinated answers lead to unstable, self-contradictory trajectories when probed.

Core Problem

Existing training-based hallucination detection methods suffer from poor cross-domain generalization, failing when the test domain distribution shifts from the training domain.

Why it matters:

Training-based detectors rely on domain-specific features, making them brittle in real-world deployments where test distributions vary (e.g., from commonsense to medical data)
Hallucinations in high-stakes fields like healthcare and finance undermine trust and safety, requiring robust detection across diverse topics
Current methods prioritize in-domain separability but neglect the challenge of maintaining that separability consistently across unseen domains

Concrete Example: A model hallucinates a book author. When asked follow-up questions about the author's other works, the model rapidly contradicts itself or shifts stance, causing its internal confidence scores (SAPLMA) to exhibit sharp 'spikes' (rise and fall). Standard single-turn detectors miss this dynamic instability.

Key Novelty

SpikeScore: Curvature-based Instability Detection in Multi-turn Dialogue

Constructs a multi-turn 'self-dialogue' by feeding the model's initial answer back as context for follow-up questions
Quantifies instability using the maximum second-order difference (curvature) of confidence scores along this dialogue path
Leverages the intuition that hallucinated answers trigger frequent self-correction and stance-shifting when probed, creating distinct 'spikes' in confidence not seen in factual answers

Architecture

Illustration of the multi-turn probing mechanism and the resulting score trajectories.

Evaluation Highlights

Outperforms state-of-the-art cross-domain methods (PRISM, ICR Probe) in average AUROC across 4 LLMs and 6 benchmarks
Achieves ~0.775 AUROC on Llama-3.1-8B (average across 5 unseen domains), surpassing the best baseline by significant margins
Generalizes effectively to RAG pipelines, outperforming baselines on TriviaQA and RAGTruth even when trained only on standard dialogue data (CoQA)

Breakthrough Assessment

8/10

Simple yet theoretically grounded approach that addresses a critical failure mode (generalization). Consistently outperforms complex baselines across diverse models and RAG settings.

⚙️ Technical Details

Problem Definition

Setting: Generalizable Hallucination Detection (GHD): Train a detector on a single source domain to identify hallucinations in N related but unseen target domains.

Inputs: Question Q and generated Answer A

Outputs: Binary label (0 for truthful, 1 for hallucinated)

Pipeline Flow

Initial Generation (User Q -> Model A)
Recursive Continuation (A + Prompt -> Follow-up Qs -> Model As)
Score Sequence Extraction (Compute SAPLMA scores for each turn)
SpikeScore Computation (Calculate max second-order difference)
Thresholding (Detect hallucination if score > lambda)

System Modules

Dialogue Simulator

Generate multi-turn continuation by feeding the initial answer back as context

Model or implementation: Same LLM as the one being evaluated (e.g., Llama-3.2-3B)

Base Scorer

Assign a confidence/truthfulness score to each answer in the sequence

Model or implementation: SAPLMA (MLP classifier on hidden states) or SEP

SpikeScore Calculator

Compute the maximum local fluctuation (curvature) of the score sequence

Model or implementation: Deterministic Formula

Novel Architectural Elements

Post-hoc multi-turn probing pipeline that uses the model's own output to induce self-contradiction loops
Second-order difference metric (SpikeScore) applied to confidence trajectories to quantify 'instability' rather than absolute uncertainty

Modeling

Base Model: Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, Qwen3-8B-Instruct, Qwen3-14B-Instruct

Training Method: Training of the Base Scorer (SAPLMA)

Objective Functions:

Purpose: Train the MLP classifier to distinguish truthful vs. hallucinated internal states.

Formally: Cross-entropy loss over labeled training data D_l.

Adaptation: MLP probe on top of frozen LLM hidden states

Training Data:

Trained on CoQA dataset (ground-truth domain)
Tested on TriviaQA, CommonsenseQA, Belebele, Math, SVAMP

Key Hyperparameters:

continuation_steps_K: 20
prompt_library_size: Not specifically detailed, see Appendix G

Compute: Not reported in the paper

Comparison to Prior Work

vs. SAPLMA/SEP: SpikeScore uses temporal dynamics (fluctuations) rather than static snapshots, improving cross-domain robustness
vs. PRISM/ICR Probe: SpikeScore requires no specialized prompt tuning or retrieval during inference (unless applied in RAG context), relying on intrinsic instability dynamics
vs. SelfCheckGPT [not cited in paper]: SelfCheckGPT checks consistency across multiple samples; SpikeScore checks consistency across a single multi-turn continuation path

Limitations

Computational cost increases linearly with the number of continuation turns (K=20)
Relies on the availability of a base scoring method (like SAPLMA) which requires some training data
Performance depends on the quality of the prompt library to successfully induce self-contradiction

Reproducibility

Code: https://github.com/YongxinDeng/SpikeScore

Code is publicly available at https://github.com/YongxinDeng/SpikeScore. Prompt library provided in Appendix G. Uses standard open-source models.

📊 Experiments & Results

Evaluation Setup

Cross-domain hallucination detection: Train on one dataset (e.g., CoQA), test on mixed pool of 5 others.

Benchmarks:

TriviaQA (Knowledge-intensive QA)
CommonsenseQA (Commonsense reasoning)
Belebele (Reading comprehension)
CoQA (Conversational QA)
Math (Mathematical reasoning)
SVAMP (Math word problems)

Metrics:

AUROC
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main cross-domain generalization results showing SpikeScore's superior average AUROC across multiple LLMs compared to baselines.
Average across 6 datasets	AUROC	0.7397	0.7550	+0.0153
Average across 6 datasets	AUROC	0.7602	0.7780	+0.0178
Average across 6 datasets	AUROC	0.7186	0.7712	+0.0526
RAG scenario evaluation (TriviaQA and RAGTruth) demonstrating robustness when applied to retrieval pipelines.
TriviaQA (RAG)	AUROC	0.7412	0.7731	+0.0319
RAGTruth	AUROC	0.7208	0.7490	+0.0282

Experiment Figures

Comparison of Expectation and Standard Deviation of SpikeScore between hallucinated and non-hallucinated domains.

Coefficient of Variation of SpikeScore across different experimental groups.

Main Takeaways

SpikeScore consistently outperforms baselines in cross-domain settings, indicating that 'instability' is a domain-invariant feature of hallucination.
The method scales well: performance gains increase with larger model sizes (e.g., Qwen3-14B), likely due to stronger self-correction mechanisms in larger models.
Robust to RAG noise: SpikeScore remains effective even when hallucinations stem from imperfect retrieval, unlike baselines which degrade significantly.
Theoretical analysis confirms that curvature-based scoring provides a probabilistic lower bound for separability between hallucinated and factual responses.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and token generation probabilities
Hallucination detection metrics (AUROC)
Basic calculus (second-order differences/curvature)

Key Terms

GHD: Generalizable Hallucination Detection—training on one domain while ensuring performance on unseen related domains

SAPLMA: A training-based method that uses an MLP on LLM internal hidden states to predict truthfulness probabilities

SpikeScore: The proposed metric measuring the maximum second-order difference (local curvature) of a score sequence in a multi-turn dialogue

AUROC: Area Under the Receiver Operating Characteristic curve—a standard metric for binary classification performance

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

CoQA: Conversational Question Answering—a dataset used here as the primary training domain

TriviaQA: A reading comprehension dataset used here as a cross-domain test set

SEP: Semantic Entropy Probe—a method estimating uncertainty by clustering semantically equivalent answers