Evaluation Setup
Hallucination detection via binary classification and behavioral analysis via intervention
Benchmarks:
- TriviaQA (In-Domain Knowledge Recall)
- Natural Questions (NQ) (In-Domain Knowledge Recall)
- BioASQ (Cross-Domain Robustness (Biomedical))
- NonExist (Fabricated Knowledge Detection) [New]
- FalseQA (Compliance with invalid premises)
- Jailbreak (Compliance with harmful instructions)
Metrics:
- Classification Accuracy
- AUROC (Area Under Receiver Operating Characteristic)
- Compliance Rate
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Hallucination detection accuracy comparisons show H-Neurons significantly outperform random neuron baselines across multiple models. |
| TriviaQA |
Accuracy |
63.8 |
76.4 |
+12.6
|
| TriviaQA |
Accuracy |
68.6 |
83.6 |
+15.0
|
| BioASQ |
Accuracy |
60.4 |
73.2 |
+12.8
|
| NonExist |
Accuracy |
61.3 |
75.0 |
+13.7
|
| Origin analysis shows H-Neurons identified in instruction-tuned models are effective predictors in base models, indicating pre-training origin. |
| TriviaQA |
AUROC |
50.0 |
86.0 |
+36.0
|
Main Takeaways
- A very sparse subset of neurons (<0.1%) is responsible for hallucinations and can be used to reliably detect them.
- H-Neurons generalize well: classifiers trained on general knowledge (TriviaQA) work on biomedical (BioASQ) and fabricated (NonExist) data.
- Causal link to over-compliance: Amplifying H-Neurons makes models more likely to agree with false premises and harmful instructions.
- Origin in pre-training: H-Neurons are established in the base model and preserved during instruction tuning, rather than being created by alignment.
- Smaller models (e.g., Gemma-4B) are more susceptible to behavioral shifts from neuron perturbation than larger models.