AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs

📝 Paper Summary

Hallucination suppression Modularized RAG pipeline

AggTruth detects hallucinations in RAG systems by aggregating internal attention scores over retrieved passages into lightweight features, enabling real-time classification without requiring multiple generations.

Core Problem

RAG systems still hallucinate when context is noisy or misused, and existing detection methods either require expensive multiple generations or lack robustness across different tasks.

Why it matters:

Hallucinations prevent the deployment of LLMs in high-stakes real-world applications where reliability is critical
Current state-of-the-art methods like Lookback-Lens rely on attention to the entire prompt (including system instructions), making them brittle to input changes
Generating multiple answers for consistency checks (e.g., self-consistency) is too slow and computationally expensive for real-time applications

Concrete Example: When an LLM answers a question based on a retrieved passage, it might generate a plausible but false entity. AggTruth detects this by observing that the model's internal attention heads fail to focus consistently on the relevant passage tokens during the generation of the false entity, unlike when generating factual content.

Key Novelty

AggTruth (Attention Aggregation for Truthfulness)

Instead of analyzing the full attention matrix, focused specifically on attention scores directed at the retrieved passage (context) during token generation
Proposed four distinct mathematical techniques to aggregate these sparse attention scores into dense feature vectors (Sum, Cosine Similarity, Entropy, Jensen-Shannon Divergence)
Introduced a 'Passage Percentage' feature to correct for the natural dilution of attention as generated sequences get longer

Architecture

The conceptual framework of AggTruth. It illustrates how attention scores from generated tokens towards context tokens are extracted and aggregated.

Evaluation Highlights

Outperforms SOTA (Lookback-Lens) on summarization tasks (CNN/DM, XSum) using Llama-3-8B-Instruct
Achieves competitive performance on QA tasks (Natural Questions, HotPotQA) while using significantly fewer features than hidden-state methods
Demonstrates robust cross-task generalization, maintaining high detection performance when trained on summarization and tested on QA (and vice versa)

Breakthrough Assessment

7/10

Solid methodological improvement for online hallucination detection. It simplifies feature extraction while improving robustness compared to Lookback-Lens, though it relies on standard classifiers.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of generated tokens/spans as 'hallucinated' or 'faithful' given the retrieved context

Inputs: Retrieval-Augmented Generation context (passage), generated response tokens, and internal attention maps

Outputs: Binary label (1 for hallucinated, 0 for non-hallucinated) for a window of generated tokens

Pipeline Flow

LLM Generation (produce response + attention maps)
Attention Extraction (isolate attention to passage tokens)
Aggregation (compress attention scores into features)
Feature Selection (select best attention heads)
Classification (Logistic Regression)

System Modules

Generator

Generate response and output raw attention tensors

Model or implementation: Various (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3)

Aggregator

Convert raw attention maps into fixed-size feature vectors

Model or implementation: Mathematical Operators (Sum, CosSim, Entropy, JS-Div)

Selector

Identify the most predictive attention heads to reduce dimensionality

Model or implementation: Spearman Correlation / Lasso / Random-based

Detector

Classify token windows as hallucinated or not

Model or implementation: Logistic Regression

Novel Architectural Elements

Hierarchical attention aggregation specifically over retrieved context (ignoring system prompt/query attention)
Passage Percentage Feature: A dynamic feature tracking the ratio of passage length to total input length to correct for attention dilution over time

Modeling

Base Model: Evaluated on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, and others

Training Method: Logistic Regression Classifier training on extracted features

Adaptation: None (LLM is frozen; only the external classifier is trained)

Trainable Parameters: Coefficients of the Logistic Regression classifier

Training Data:

QA: 1,535 examples from Natural Questions, 896 from HotPotQA
Summarization: 1000 examples from CNN/Daily Mail, 1000 from XSum
Labels generated via GPT-4o acting as a judge (token-level binary labels)

Key Hyperparameters:

window_size: 8 tokens
spearman_p_value_threshold: 0.001
context_length_limit: 4096 tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Lookback-Lens: AggTruth focuses only on the passage context rather than the whole prompt, and introduces aggregation metrics beyond simple ratios
vs. Hidden States-based: AggTruth uses interpretable attention features rather than opaque hidden states
vs. HEAD-ACHE [not cited in paper]: AggTruth uses aggregation metrics (entropy, JS-div) similar to HEAD-ACHE but applies them specifically for RAG context verification

Limitations

Relies on GPT-4o for ground truth labeling, which may introduce bias or errors
Analysis restricted to 4,096 context length due to sliding window attention limitations in some models
Requires access to internal attention weights, limiting applicability to open-weights models only (no API-only models)
Performance depends heavily on the specific feature selection method used

Reproducibility

Code: https://github.com/piotrmatys/AggTruth

Code is publicly available. Dataset construction pipeline is detailed (using GPT-4o as judge). Evaluation uses standard open datasets (NQ, HotPotQA, CNN/DM, XSum).

📊 Experiments & Results

Evaluation Setup

Detection of hallucinations in RAG outputs across QA and Summarization tasks

Benchmarks:

Natural Questions (NQ) (Question Answering)
HotPotQA (Multi-hop Question Answering)
CNN/Daily Mail (Summarization)
XSum (Summarization)

Metrics:

AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
PR-AUC (Precision-Recall Area Under Curve)
Statistical methodology: 5-fold cross-validation for training; statistical significance threshold p=0.001 used for Spearman feature selection

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Summarization Tasks (Llama-3-8B-Instruct): AggTruth variants generally outperform baselines.
CNN/Daily Mail	AUC-ROC	0.77	0.79	+0.02
XSum	AUC-ROC	0.71	0.75	+0.04
Performance on Question Answering Tasks (Llama-3-8B-Instruct): AggTruth is competitive but Hidden States performs slightly better on NQ.
Natural Questions (NQ)	AUC-ROC	0.83	0.82	-0.01
HotPotQA	AUC-ROC	0.76	0.77	+0.01
Cross-Task Generalization: Training on Summarization and testing on QA (and vice versa) demonstrates robustness.
Average across tasks	AUC-ROC	0.75	0.77	+0.02

Experiment Figures

The complete data processing pipeline from Prompt to Feature Selection.

Main Takeaways

Careful selection of attention heads is critical; selecting heads based on Spearman correlation with the target works best.
AggTruth variants (especially Sum) demonstrate stable performance across both same-task and cross-task setups, indicating better generalization than baselines.
The method is efficient as it relies on a simple Logistic Regression classifier over aggregated features, avoiding the cost of heavy neural classifiers.
The 'Passage Percentage' feature is important for correcting the attention distribution shift that occurs as generation length increases.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically Attention Mechanism)
Retrieval-Augmented Generation (RAG)
Feature selection methods (Lasso, Boruta)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Attention Map: A matrix representing how much focus a model places on different parts of the input when generating a specific token

Contextual Hallucination: A specific type of error where the model's output contradicts or is not supported by the provided retrieved source text

Lookback-Lens: A baseline method that detects hallucinations by analyzing the ratio of attention placed on the prompt versus newly generated tokens

Jensen-Shannon Divergence: A method of measuring the similarity between two probability distributions; used here to measure how much individual attention heads deviate from the layer average

Boruta: A feature selection algorithm that compares features' importance against randomized 'shadow' features to find statistically significant predictors

Lasso: Least Absolute Shrinkage and Selection Operator—a regression analysis method that performs both variable selection and regularization

Intrinsic Hallucination: Model output that directly conflicts with the provided source material

Greedy Decoding: A generation strategy where the model always selects the highest-probability token at each step