Evaluation Setup
Binary classification of generated text segments as hallucinated or truthful
Benchmarks:
- HELM (Hallucination Detection) [New]
Metrics:
- Classification Accuracy
- Detection Latency/Overhead
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Preliminary feasibility study on LLaMA2-13B-Chat (5k samples) using different internal state features. |
| Custom 5k sample set |
Accuracy |
50.00 |
72.40 |
+22.40
|
| Custom 5k sample set |
Accuracy |
50.00 |
69.00 |
+19.00
|
| Custom 5k sample set |
Accuracy |
50.00 |
73.20 |
+23.20
|
| Custom 5k sample set |
Accuracy |
50.00 |
73.60 |
+23.60
|
Main Takeaways
- MIND effectively detects hallucinations using only internal states, validating that hallucinations have distinct neural signatures.
- The embedding of the last token in the final layer is the most discriminative single feature, suggesting the model 'knows' its uncertainty or error state at the moment of output.
- Adding features from earlier layers or previous tokens yields diminishing returns compared to the computational cost, making the 'last token, last layer' approach optimal for real-time use.
- The method generalizes to a simple MLP classifier, confirming the boundary between hallucination and truth is linearly separable in the embedding space.