LLM Factoscope: Uncovering LLMs' Factual Discernment through Inner States Analysis

📝 Paper Summary

Hallucination suppression Internal state analysis

LLM Factoscope detects factual errors in LLM outputs without external knowledge bases by analyzing patterns in the model's inner states (activations, ranks, and probabilities) using a Siamese network.

Core Problem

LLMs frequently hallucinate non-factual content, and current detection methods rely on expensive external database cross-referencing or computationally heavy sampling.

Why it matters:

Non-factual outputs in critical domains like medicine and law can cause harm to users
Dependency on external knowledge bases introduces complexity and latency
Existing internal methods like semantic uncertainty require repeated sampling, increasing computational overhead

Concrete Example: When asked 'The film titled The Shining was directed by', an LLM might confidently output 'Stanley Kubrick' (factual). If asked about 'The Beekeeper', it might output vague hallucinations. Factoscope aims to distinguish these using only the model's internal neural activations and output probabilities.

Key Novelty

LLM Lie Detector via Multi-view Inner States

Treats LLM hallucination detection like a human lie detector test by monitoring 'physiological' signals: activation maps and output dynamics
Combines static features (neuron activation intensity) and dynamic features (evolution of output ranks/probabilities across layers) into a unified detection pipeline
Uses a Siamese network with triplet margin loss to learn a metric space where factual and non-factual internal states are separable

Architecture

The LLM Factoscope pipeline: Data Collection -> Inner States Extraction -> Siamese Network Model.

Evaluation Highlights

Achieves >96% accuracy on a custom-collected factual detection dataset across multiple architectures (Llama2, Vicuna, GPT2-XL)
Outperforms unsupervised baselines like Min-k% Prob and unexpectedness metrics by significant margins
Demonstrates strong generalization: a detector trained on Llama2-7B transfers effectively to other models like Vicuna-7B

Breakthrough Assessment

7/10

Strong empirical results (>96% accuracy) and a novel, resource-efficient approach that avoids external retrieval. However, reliance on a specific type of triplet-relation fact dataset may limit scope compared to free-form generation.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of LLM output factuality based on internal model states during generation

Inputs: Input prompt X and the LLM's internal states (activations, logits) corresponding to the generated token

Outputs: Binary label: Factual or Non-factual

Pipeline Flow

Data Collection (Kaggle datasets → Prompts → LLM Generation → Labels)
Inner State Extraction (Activations, Ranks, Top-k Indices, Probabilities)
Feature Preprocessing (Normalization, Distance Calculation)
Classification Model (Siamese Network with CNN/GRU encoders)

System Modules

Inner State Extractor

Captures raw internal data from the LLM during inference

Model or implementation: Target LLM (e.g., Llama2-7B)

Factoscope Classifier

Encodes extracted features into embeddings and classifies factuality

Model or implementation: Siamese Network (4 sub-models: 3 ResNet18 CNNs + 1 GRU)

Novel Architectural Elements

Integration of four distinct inner state modalities (activations, ranks, indices, probabilities) into a single mixed representation embedding
Use of ResNet18 to process activation maps and probability distributions as 'images/maps' across layers

Modeling

Base Model: Evaluated on GPT2-XL-1.5B, Llama2-7B/13B, Vicuna-7B/13B, Stablelm-7B

Training Method: Supervised learning via Triplet Margin Loss on extracted features

Objective Functions:

Purpose: Learn an embedding space where factual samples are close and non-factual are distant.

Formally: L = max(Dist(Anchor, Positive) - Dist(Anchor, Negative) + alpha, 0)

Training Data:

Custom dataset derived from Kaggle (Wikidata, IMDB, etc.)
Prompts generated from (Subject, Relation, Object) triplets
Labels determined by checking if LLM output matches ground truth object

Key Hyperparameters:

margin_alpha: Not explicitly reported in the paper
distance_metric: Euclidean distance

Compute: Not reported in the paper

Comparison to Prior Work

vs. SAT Probe: Factoscope uses a non-linear deep Siamese network and multiple feature types (ranks, dynamics) rather than a simple linear probe
vs. Min-k% Prob: Factoscope leverages internal states across layers, not just final output probabilities
vs. CCS: Factoscope uses supervised training with known labels, finding CCS unreliable in their preliminary tests

Limitations

Relies on the assumption that factual data generates distinct activation patterns (which may not hold for all types of facts)
Dataset construction focuses on triplet-based knowledge (Subject-Relation-Object), which may not cover long-form reasoning hallucinations
Requires access to model weights and internal states (white-box access)

Reproducibility

Code: https://github.com/JenniferHo97/llm_factoscope

Code and dataset released (https://github.com/JenniferHo97/llm_factoscope). Kaggle source datasets are public. Exact hyperparameters for the ResNet/GRU training (learning rate, batch size) are not detailed in the text.

📊 Experiments & Results

Evaluation Setup

Binary classification of generated outputs (Factual vs. Non-Factual)

Benchmarks:

Custom Factual Dataset (Knowledge Probing (Relation extraction)) [New]

Metrics:

Accuracy
AUROC
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Custom Factual Dataset	Accuracy	0.85	0.98	+0.13
Custom Factual Dataset	Accuracy	0.58	0.98	+0.40
Custom Factual Dataset	Accuracy	Not applicable	0.96	Not applicable

Experiment Figures

Visualization of activation maps and output rank evolution for factual vs non-factual outputs

Top-k output indices and probabilities across layers

Main Takeaways

Factual outputs exhibit higher activation intensity in specific neurons and more stable output ranks across layers compared to non-factual outputs
Semantic similarity of top-1 candidates across layers is higher for factual outputs
The method generalizes across model families (Llama, Vicuna, GPT-2) with consistent high accuracy (>96%)
Combining multiple inner state features (activations + dynamics) yields better performance than single-feature approaches

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (layers, hidden states, unembedding)
Siamese Neural Networks
Triplet Loss
Logit Lens / Probing

Key Terms

Logit Lens: A technique to decode intermediate transformer layers using the final unembedding matrix to see what token the model predicts at that specific layer

Siamese Network: A neural network architecture that uses the same weights to process two different inputs and compares their output vectors (embeddings)

Triplet Margin Loss: A loss function that minimizes the distance between an anchor and a positive sample while maximizing the distance between the anchor and a negative sample

Activation Map: The set of activation values across all neurons in the LLM layers for a specific input

Top-k Output Indices: The indices of the k tokens with the highest probability scores at a specific layer

Factoscope: The proposed pipeline comprising data collection, inner state extraction, and a Siamese network classifier