Institute of Information Engineering, Chinese Academy of Sciences,
School of Cyber Security, University of Chinese Academy of Sciences
arXiv
(2023)
FactualityBenchmark
📝 Paper Summary
Hallucination suppressionInternal state analysis
LLM Factoscope detects factual errors in LLM outputs without external knowledge bases by analyzing patterns in the model's inner states (activations, ranks, and probabilities) using a Siamese network.
Core Problem
LLMs frequently hallucinate non-factual content, and current detection methods rely on expensive external database cross-referencing or computationally heavy sampling.
Why it matters:
Non-factual outputs in critical domains like medicine and law can cause harm to users
Dependency on external knowledge bases introduces complexity and latency
Concrete Example:When asked 'The film titled The Shining was directed by', an LLM might confidently output 'Stanley Kubrick' (factual). If asked about 'The Beekeeper', it might output vague hallucinations. Factoscope aims to distinguish these using only the model's internal neural activations and output probabilities.
Key Novelty
LLM Lie Detector via Multi-view Inner States
Treats LLM hallucination detection like a human lie detector test by monitoring 'physiological' signals: activation maps and output dynamics
Combines static features (neuron activation intensity) and dynamic features (evolution of output ranks/probabilities across layers) into a unified detection pipeline
Uses a Siamese network with triplet margin loss to learn a metric space where factual and non-factual internal states are separable
Architecture
The LLM Factoscope pipeline: Data Collection -> Inner States Extraction -> Siamese Network Model.
Evaluation Highlights
Achieves >96% accuracy on a custom-collected factual detection dataset across multiple architectures (Llama2, Vicuna, GPT2-XL)
Outperforms unsupervised baselines like Min-k% Prob and unexpectedness metrics by significant margins
Demonstrates strong generalization: a detector trained on Llama2-7B transfers effectively to other models like Vicuna-7B
Breakthrough Assessment
7/10
Strong empirical results (>96% accuracy) and a novel, resource-efficient approach that avoids external retrieval. However, reliance on a specific type of triplet-relation fact dataset may limit scope compared to free-form generation.
⚙️ Technical Details
Problem Definition
Setting: Binary classification of LLM output factuality based on internal model states during generation
Inputs: Input prompt X and the LLM's internal states (activations, logits) corresponding to the generated token
Code and dataset released (https://github.com/JenniferHo97/llm_factoscope). Kaggle source datasets are public. Exact hyperparameters for the ResNet/GRU training (learning rate, batch size) are not detailed in the text.
📊 Experiments & Results
Evaluation Setup
Binary classification of generated outputs (Factual vs. Non-Factual)
Logit Lens: A technique to decode intermediate transformer layers using the final unembedding matrix to see what token the model predicts at that specific layer
Siamese Network: A neural network architecture that uses the same weights to process two different inputs and compares their output vectors (embeddings)
Triplet Margin Loss: A loss function that minimizes the distance between an anchor and a positive sample while maximizing the distance between the anchor and a negative sample
Activation Map: The set of activation values across all neurons in the LLM layers for a specific input
Top-k Output Indices: The indices of the k tokens with the highest probability scores at a specific layer
Factoscope: The proposed pipeline comprising data collection, inner state extraction, and a Siamese network classifier