Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention

📝 Paper Summary

Hallucination suppression Factuality

Hallucinated tokens in LLMs exhibit rapid, high-frequency fluctuations in attention weights, which can be detected by treating attention as a discrete signal and extracting high-frequency energy components.

Core Problem

Existing attention-based hallucination detectors rely on coarse summary statistics (like entropy or total mass) that fail to capture the fine-grained sequential instability characteristic of ungrounded generation.

Why it matters:

Hallucinations in context-based generation (like RAG or summarization) undermine trust in LLM systems expected to be grounded in source material
Post-hoc verification methods are computationally expensive and do not reflect the model's internal generation dynamics
Current internal metrics miss the structural 'jaggedness' of attention that signals when a model is confused or ungrounded

Concrete Example: When LLaMA-2-7B-Chat hallucinates 'December' (not in context), its attention distribution shows sharp, rapid peaks and drops across token positions. A standard entropy metric might look normal, but the signal oscillates wildly compared to the smooth attention of a grounded token.

Key Novelty

Frequency-Aware Attention Analysis

Treats the sequence of attention weights over context tokens as a discrete time-series signal indexed by token position
Applies signal processing operators (Fourier Transform, Wavelets, Laplacian) to isolate high-frequency components that represent rapid local changes
Uses the energy (L2 norm) of these high-frequency components as a feature vector to classify tokens as hallucinated or grounded

Architecture

The Frequency-Aware Attention framework pipeline.

Evaluation Highlights

Fourier-high features improve AUROC by 6.6% over Lookback-Lens on the RAGTruth summarization task with LLaMA-13B
Consistent gains achieved across 3 models (LLaMA-7B, LLaMA-13B, Mistral-7B) and 2 benchmarks (RAGTruth, HalluRAG)
Span-level detection improves AUROC by 10.1% on summarization tasks (LLaMA-7B) compared to the strong attention-based baseline Lookback-Lens

Breakthrough Assessment

7/10

Offers a novel, theoretically grounded perspective (signal processing) on attention analysis. While the method is a feature engineering step for a classifier rather than a new architecture, the consistent empirical gains and cross-task robustness are significant.

⚙️ Technical Details

Problem Definition

Setting: Context-based generation where a model generates response tokens conditioned on a retrieved context

Inputs: Context tokens ctx and previously generated tokens gen_<i

Outputs: Binary classification r_i for the current token t_i (hallucinated vs. grounded)

Pipeline Flow

Attention Extraction: Extract weights from all layers/heads for current token
Signal Construction: Treat attention over context as a discrete signal x
High-Frequency Extraction: Apply High-Pass Filter (DFT/DWT/Laplacian)
Energy Calculation: Compute L2 norm of the high-frequency component
Classification: Concatenate energy features and feed to logistic regression

System Modules

Attention Extractor

Extracts raw attention weights from the LLM during generation

Model or implementation: Target LLM (e.g., LLaMA-7B-Chat)

Frequency Filter (Feature Engineering)

Isolates rapid fluctuations in the attention signal

Model or implementation: Signal Processing Operators (DFT, DWT, or Laplacian)

Energy Calculator (Feature Engineering)

Quantifies the instability into a scalar feature per head

Model or implementation: L2 Norm

Hallucination Classifier

Predicts if the token is a hallucination based on aggregated energy features

Model or implementation: Logistic Regression

Novel Architectural Elements

Integration of signal processing operators (DFT, DWT, Laplacian) directly into the attention analysis pipeline for feature extraction
Formulation of attention weights as discrete temporal signals indexed by token position specifically for instability detection

Modeling

Base Model: Evaluated on LLaMA-2-7B-Chat, LLaMA-2-13B-Chat, and Mistral-7B-Instruct-v0.2

Training Method: Supervised training of a lightweight classifier on extracted features

Objective Functions:

Purpose: Minimize classification error for hallucination detection.

Formally: Standard Logistic Regression objective (Log Loss).

Adaptation: None (Base LLM is frozen; only the external classifier is trained)

Training Data:

RAGTruth: ~18k examples with token-level annotations
HalluRAG: QA pairs with token-level annotations

Compute: Not reported in the paper

Comparison to Prior Work

vs. Lookback-Lens: Focuses on the *shape/fluctuation* of attention (high-frequency energy) rather than just the *amount* (mass) allocated to context
vs. EigenScore: Analyzes attention weights directly rather than output probability distributions
vs. Entropy-based methods: Captures sequential structure and local instability, whereas entropy is permutation-invariant and misses jagged patterns
+ 1 more
vs. HEADING [not cited in paper]: HEADING uses attribution to identify hallucination, but relies on costly gradient computations, whereas this method uses forward-pass attention weights only

Limitations

Relies on the assumption that semantic heterogeneity maps to attention instability, which is a simplified view of complex LLM mechanisms
Requires access to internal attention weights, making it inapplicable to black-box API models
Performance depends on a trained linear classifier, requiring labeled hallucination data for the specific task/domain
Span-level aggregation uses a fixed sliding window, which may not align perfectly with semantic boundaries

Reproducibility

Code: https://github.com/siyaqi/FrequencyAwareHallucination

Code and data are publicly available at https://github.com/siyaqi/FrequencyAwareHallucination. The paper details the exact baselines and benchmarks used (RAGTruth, HalluRAG).

📊 Experiments & Results

Evaluation Setup

Token-level and Span-level hallucination detection on context-based generation tasks

Benchmarks:

RAGTruth (QA, Data-to-Text, Summarization)
HalluRAG (Context-based QA)

Metrics:

AUROC
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Token-level detection results show Frequency-Aware features (specifically Fourier-based) outperforming baselines across LLaMA-7B.
RAGTruth (Summarization)	AUROC	74.8	77.5	+2.7
RAGTruth (Summarization)	AUROC	73.2	79.8	+6.6
RAGTruth (QA)	AUROC	79.7	81.9	+2.2
Span-level detection results demonstrate that aggregating frequency features over sliding windows is highly effective.
RAGTruth (Summarization)	AUROC	69.0	79.1	+10.1
RAGTruth (Avg)	AUROC	73.9	79.2	+5.3

Experiment Figures

Comparison of attention distributions for a grounded token vs. a hallucinated token.

Main Takeaways

Explicitly modeling attention as a signal reveals that hallucinated tokens are associated with high-frequency energy (rapid fluctuations).
Fourier-based features generally perform best, followed by Wavelets and Laplacian, suggesting global frequency analysis is slightly more robust than local.
The method generalizes well from token-level to span-level detection without architectural changes, showing robustness to aggregation.
Cross-task transfer experiments indicate better generalization than Lookback-Lens, suggesting the frequency signature of hallucination is more universal than simple attention mass.

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanism (Key, Query, Value)
Basic signal processing (Fourier Transform, High-pass filtering)
Hallucination detection metrics (AUROC)

Key Terms

DFT: Discrete Fourier Transform—a mathematical technique that decomposes a signal into its constituent frequencies (global decomposition)

DWT: Discrete Wavelet Transform—a signal processing technique using localized basis functions to capture changes at different scales and positions

Laplacian operator: A local operator that calculates the second-order difference between adjacent points, effectively acting as a simple high-pass filter

Parseval's theorem: A theorem stating that the total energy of a signal is the same whether computed in the time domain or the frequency domain

Lookback-Lens: A baseline method that detects hallucinations by analyzing the ratio of attention weights assigned to context versus generated tokens

RAGTruth: A benchmark dataset for hallucination detection in RAG scenarios, covering QA, summarization, and data-to-text tasks

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for binary classification where 0.5 is random guessing and 1.0 is perfect

EigenScore: A baseline method using spectral analysis of the semantic consistency graph constructed from output probability distributions

SelfCheckGPT: A baseline method that detects hallucinations by checking consistency across multiple sampled responses from the model

high-pass filter: A filter that allows high-frequency signals (rapid changes) to pass through while attenuating low-frequency signals (smooth trends)