H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs

📝 Paper Summary

Hallucination suppression Mechanistic Interpretability

Hallucinations in LLMs are controlled by a tiny subset of neurons (<0.1%) originating in pre-training, which encode a general tendency toward over-compliance rather than just factual errors.

Core Problem

LLMs frequently generate hallucinations, yet most research treats models as black boxes, neglecting the microscopic neuron-level mechanisms that drive these errors.

Why it matters:

Hallucinations undermine reliability in critical tasks, with GPT-4 still hallucinating in ~28.6% of citation-based evaluations
Current mitigation strategies (data cleaning, RLHF) operate at a macroscopic level without understanding the internal computational roots of the problem
Understanding specific neurons allows for precise detection and potential intervention without retraining the entire model

Concrete Example: When a model is asked about a non-existent entity like 'volor pri octacap', it often confidently fabricates an answer (e.g., describing it as a medicine). H-Neurons activate during this process, signaling the transition from factual recall to fabrication.

Key Novelty

Identification and Causal Analysis of Hallucination-Associated Neurons (H-Neurons)

Identifies a sparse set of neurons (less than 0.1% of total) in feedforward networks that reliably predict hallucinations using sparse logistic regression on activation patterns
Demonstrates that these neurons represent a general 'over-compliance' behavior (agreeing with false premises or harmful instructions) rather than just factual incorrectness
Traces the origin of these neurons back to the pre-training phase, showing they are not merely artifacts of instruction tuning or alignment

Architecture

The workflow for identifying H-Neurons using sparse linear probing.

Evaluation Highlights

Detects hallucinations with high accuracy using only <0.1% of neurons (e.g., >86% AUROC on TriviaQA with Mistral models)
Generalizes effectively to completely fabricated questions about non-existent entities (NonExist dataset) and cross-domain biomedical questions (BioASQ)
Amplifying H-Neurons systematically increases compliance with harmful instructions (Jailbreak) and false premises (FalseQA), establishing a causal link

Breakthrough Assessment

8/10

Strong mechanistic finding linking hallucinations to specific, sparse neurons and tracing them to pre-training. The connection between hallucination and 'over-compliance' offers a significant theoretical reframing of the problem.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of model outputs (Faithful vs. Hallucinated) based on internal neuron activations

Inputs: Neuron activation values from Feedforward Networks (FFNs) during response generation

Outputs: Probability that the current response is a hallucination

Pipeline Flow

Data Collection (sample faithful/hallucinated responses)
Activation Extraction (record FFN neuron values)
Sparse Classifier Training (identify H-Neurons)
Inference/Intervention (monitor or scale H-Neuron activations)

System Modules

Data Collector (Identification Phase)

Generate responses to QA datasets and label them as Faithful or Hallucinated

Model or implementation: Target LLM (e.g., Llama-3, Mistral)

Feature Extractor (Identification Phase)

Calculate neuron contribution metrics for each response

Model or implementation: Target LLM

Sparse Classifier (Identification Phase)

Select the most predictive neurons for hallucination

Model or implementation: Logistic Regression with L1 Regularization

Intervention Mechanism

Scale the activation of identified H-Neurons during inference

Model or implementation: Target LLM (modified forward pass)

Novel Architectural Elements

Integration of sparse linear probing on FFN activations specifically for hallucination detection
Direct scaling mechanism for specific neuron subsets (H-Neurons) to modulate compliance behavior

Modeling

Base Model: Evaluated on multiple families: Llama-3 (8B, 70B), Mistral (7B-v0.3, Small-24B), Gemma-2 (9B, 27B)

Training Data:

TriviaQA for training the sparse classifier
Evaluation on NQ, BioASQ, NonExist, FalseQA, FaithEval, Sycophancy, Jailbreak

Key Hyperparameters:

sparsity_ratio: <0.1% of total neurons selected
scaling_factor_alpha: Range [0, 3] for intervention experiments

Compute: Not reported in the paper

Comparison to Prior Work

vs. Hidden State Probes: Focuses specifically on individual neurons in FFNs rather than aggregate hidden states, enabling precise intervention
vs. SAEs: Uses supervised sparse probing (labels from TriviaQA) to find task-specific neurons rather than unsupervised feature discovery
vs. Macroscopic Analysis: Provides microscopic, neuron-level causal mechanism for hallucinations

Limitations

Intervention (scaling neurons) is not monotonic for all models; some show fluctuations at intermediate scaling factors
Analysis focuses primarily on Feedforward Networks (FFNs), potentially missing contributions from attention heads
Simple suppression of H-Neurons helps factuality but might negatively impact helpfulness if not carefully balanced

Reproducibility

The paper does not explicitly provide a code repository URL. Datasets used (TriviaQA, NQ, BioASQ, FalseQA, etc.) are public. The method for identifying H-Neurons (L1-regularized logistic regression on CETT metrics) is described in detail.

📊 Experiments & Results

Evaluation Setup

Hallucination detection via binary classification and behavioral analysis via intervention

Benchmarks:

TriviaQA (In-Domain Knowledge Recall)
Natural Questions (NQ) (In-Domain Knowledge Recall)
BioASQ (Cross-Domain Robustness (Biomedical))
NonExist (Fabricated Knowledge Detection) [New]
FalseQA (Compliance with invalid premises)
Jailbreak (Compliance with harmful instructions)

Metrics:

Classification Accuracy
AUROC (Area Under Receiver Operating Characteristic)
Compliance Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Hallucination detection accuracy comparisons show H-Neurons significantly outperform random neuron baselines across multiple models.
TriviaQA	Accuracy	63.8	76.4	+12.6
TriviaQA	Accuracy	68.6	83.6	+15.0
BioASQ	Accuracy	60.4	73.2	+12.8
NonExist	Accuracy	61.3	75.0	+13.7
Origin analysis shows H-Neurons identified in instruction-tuned models are effective predictors in base models, indicating pre-training origin.
TriviaQA	AUROC	50.0	86.0	+36.0

Experiment Figures

The causal effect of scaling H-Neuron activations on various 'over-compliance' behaviors.

Analysis of H-Neuron origins and stability across training stages.

Main Takeaways

A very sparse subset of neurons (<0.1%) is responsible for hallucinations and can be used to reliably detect them.
H-Neurons generalize well: classifiers trained on general knowledge (TriviaQA) work on biomedical (BioASQ) and fabricated (NonExist) data.
Causal link to over-compliance: Amplifying H-Neurons makes models more likely to agree with false premises and harmful instructions.
Origin in pre-training: H-Neurons are established in the base model and preserved during instruction tuning, rather than being created by alignment.
Smaller models (e.g., Gemma-4B) are more susceptible to behavioral shifts from neuron perturbation than larger models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture (specifically Feedforward Networks)
Familiarity with sparse linear classifiers (L1 regularization)
Basic knowledge of LLM training stages (Pre-training vs. Instruction Tuning)

Key Terms

H-Neurons: Hallucination-associated neurons—a sparse subset of neurons whose activation patterns reliably predict hallucinatory outputs

FFN: Feedforward Network—a component within Transformer blocks where neurons process information independently of token position

CETT: A metric (Contribution to the Emerging Token) used to quantify a neuron's activation level and importance during generation

L1 regularization: A technique in machine learning that penalizes the sum of absolute weights, encouraging the model to select a sparse set of features (neurons) by driving most weights to zero

AUROC: Area Under the Receiver Operating Characteristic Curve—a performance metric for classification problems at various threshold settings

SFT: Supervised Fine-Tuning—the process of training a pre-trained base model on instruction-response pairs

over-compliance: The tendency of a model to satisfy user requests or agree with premises even when doing so compromises truthfulness or safety (e.g., answering a question based on a false assumption)