← Back to Paper List

Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga
Robert Gordon University, Toyota Motor Europe
arXiv (2025)
Factuality QA Reasoning

📝 Paper Summary

Hallucination detection Activation probing Hallucination mitigation
CLAP detects hallucinations by processing the entire residual stream of an LLM as a sequence using an attention mechanism, enabling fine-grained detection and reliable mitigation.
Core Problem
Existing activation probing methods examine individual layers in isolation, missing information distributed across the residual stream, and fail to distinguish fine-grained hallucinations among different sampled responses.
Why it matters:
  • LLMs frequently generate hallucinations, posing risks of misinformation in critical applications like healthcare and search
  • Direct mitigation methods (like DoLa) often degrade valid responses by modifying activations indiscriminately
  • Current probes struggle to generalize when prompts fall outside the training domain (out-of-distribution)
Concrete Example: When an LLM is asked a question, it might generate a correct answer in a greedy decode but hallucinate in a sampled response. A single-layer probe might miss the hallucination in the sampled response if the error signal is distributed across multiple layers or if the probe overfits to the greedy decoding pattern.
Key Novelty
Cross-Layer Attention Probing (CLAP)
  • Treats the set of activations from all LLM layers as a sequence of input tokens rather than isolated vectors
  • Uses a small transformer encoder with an attention mechanism to learn which layers contain the most relevant signals for detecting hallucinations
  • Incorporates responses sampled at high temperatures during training to learn fine-grained boundaries between hallucinated and non-hallucinated outputs
Evaluation Highlights
  • Outperforms single-layer probes on out-of-distribution tasks, achieving significant AUC gains (e.g., +6.5% vs. Last Layer on Llama-3-8B-Instruct)
  • Reduces abstention rate by 24.5% on average compared to baseline mitigation strategies while maintaining high non-hallucination rates
  • Combining CLAP with DoLa mitigation significantly reduces the rate at which correct responses are wrongly modified (NH->H errors) compared to using DoLa alone
Breakthrough Assessment
7/10
Novel architectural approach to probing that effectively uses the whole residual stream. Strong empirical results on out-of-distribution generalization and mitigation integration.
×