Steer LLM Latents for Hallucination Detection

📝 Paper Summary

Hallucination detection Internal state analysis

The paper introduces a lightweight, learnable steering vector added during inference to reshape LLM latent spaces, separating truthful from hallucinated representations using minimal labeled data and optimal transport.

Core Problem

Pre-trained LLM embeddings are optimized for linguistic coherence rather than factual accuracy, resulting in latent spaces where truthful and hallucinated content are not clearly separated.

Why it matters:

Hallucinations undermine user trust and cause harm in high-stakes applications, making detection critical for safe deployment
Existing detection methods relying on default embeddings fail because the model prioritizes fluency over truthfulness during pre-training
Fully supervised approaches require expensive large-scale human annotations, which are impractical for many real-world applications

Concrete Example: When an LLM generates a bio for a real person, it might mix true facts with plausible-sounding lies. Default embeddings for both truthful and hallucinated sentences look similar because both are grammatically fluent. TSV pushes these embeddings apart so a simple classifier can tell them apart.

Key Novelty

Truthfulness Separator Vector (TSV)

Learns a single vector added to hidden states during inference that pushes truthful and hallucinated embeddings into distinct clusters without changing model weights
Uses a two-stage training process: first clustering with a tiny labeled set, then refining with large-scale unlabeled data via optimal transport pseudo-labeling

Architecture

Illustration of the TSV intervention during inference and the training pipeline.

Evaluation Highlights

+12.8% improvement in hallucination detection accuracy (AUROC) on TruthfulQA compared to state-of-the-art methods
Achieves 84.2% AUROC on TruthfulQA with only 32 labeled examples, comparable to the fully supervised upper bound of 85.5%
Demonstrates strong generalization to unseen datasets, maintaining competitive performance even when transferred across different domains

Breakthrough Assessment

8/10

Significant improvement in detection accuracy with minimal supervision. The approach of steering representations specifically for detection (separation) rather than generation (mitigation) is a novel and practical distinction.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of LLM generations as truthful or hallucinated based on internal hidden states

Inputs: Input prompt x_prompt and generated response x_tilde

Outputs: Binary label (0 for hallucinated, 1 for truthful)

Pipeline Flow

Input Processing (Prompt + Generation)
Latent Steering (Add TSV)
Feature Extraction (Last Token)
Classification (Prototype Distance)

System Modules

LLM Backbone

Process input prompt and generation to produce hidden states

Model or implementation: Llama-2-7B-Chat (and others in experiments)

Steering Mechanism

Inject the learned Truthfulness Separator Vector (TSV) into hidden states

Model or implementation: Vector Addition

Classifier

Classify the steered representation as truthful or hallucinated

Model or implementation: Prototype-based Classifier

Novel Architectural Elements

Inference-time injection of a learned separator vector (TSV) specifically optimized for classification separation rather than generation steering

Modeling

Base Model: Llama-2-7B-Chat (primary), Vicuna-7B-v1.5, Llama-3-8B-Instruct

Training Method: Vector learning via Maximum Likelihood Estimation (MLE) with Optimal Transport pseudo-labeling

Objective Functions:

Purpose: Cluster embeddings around class prototypes.

Formally: Minimize negative log-likelihood of the von Mises-Fisher distribution over labeled/pseudo-labeled data.
Purpose: Assign pseudo-labels to unlabeled data while respecting class balance.

Formally: Minimize transport cost <Q, P> subject to marginal constraints using Sinkhorn algorithm.

Training Data:

Small labeled exemplar set (e.g., N=32)
Large unlabeled set of LLM generations in the wild

Key Hyperparameters:

lambda: Steering strength (hyperparameter)
epsilon: 0.05 (entropy regularization for OT)
alpha: Exponential moving average decay rate

Compute: Lightweight (training only optimizes a single vector d-dim vector)

Comparison to Prior Work

vs. HaloScope: TSV actively steers embeddings to separate classes, whereas HaloScope analyzes static, un-steered embeddings
vs. ITISP: TSV optimizes for separation (detection) rather than truthfulness induction (generation mitigation) [not cited in paper]
vs. Supervised Probing (SAPLMA): TSV uses semi-supervised learning with minimal labels (32 examples) vs. requiring large labeled datasets

Limitations

Relies on the assumption that truthful and hallucinated data form distinct clusters (von Mises-Fisher distribution)
Performance depends on the quality of the small exemplar set used for initialization
Requires access to model internal states (white-box access), not applicable to API-only black-box models

Reproducibility

Code availability is not provided in the paper text. The method relies on standard datasets (TruthfulQA) and open models (Llama-2).

📊 Experiments & Results

Evaluation Setup

Hallucination detection on QA tasks

Benchmarks:

TruthfulQA (QA factuality evaluation)
Fact-checking datasets (Generalization check)

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TSV significantly outperforms baselines on the TruthfulQA benchmark using Llama-2-7B-Chat.
TruthfulQA	AUROC	71.4	84.2	+12.8
TruthfulQA	AUROC	85.5	84.2	-1.3
TSV demonstrates scalability across different model architectures.
TruthfulQA	AUROC	84.2	86.1	+1.9

Experiment Figures

t-SNE visualization of original vs. steered embeddings.

Main Takeaways

TSV consistently outperforms unsupervised and few-shot baselines across datasets
The method closes the gap with fully supervised approaches while using orders of magnitude less labeled data (32 examples)
Optimal transport-based pseudo-labeling is effective for leveraging unlabeled data to improve boundary separation
The learned vector is transferable and effective across different model families (Llama-2, Llama-3, Vicuna)

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (hidden states, layers)
Vector embeddings and latent spaces
Maximum Likelihood Estimation (MLE)
Optimal Transport (OT)

Key Terms

TSV: Truthfulness Separator Vector—a learnable vector added to LLM hidden states to push truthful and hallucinated representations apart

Optimal Transport: A mathematical framework used here to assign pseudo-labels to unlabeled data by minimizing the 'cost' of moving data points to class prototypes

AUROC: Area Under the Receiver Operating Characteristic curve—a metric measuring how well a classifier distinguishes between classes (0.5 is random, 1.0 is perfect)

von Mises-Fisher distribution: A probability distribution on a sphere, used here to model normalized embeddings where direction matters more than magnitude

Steering Vector: A vector added to model activations to influence behavior or representation without changing weights

pseudo-labeling: Assigning approximate labels to unlabeled data based on the model's current confidence, allowing that data to be used for training

latent space: The internal vector representation of data within the model

sinkhorn algorithm: An efficient algorithm used to solve optimal transport problems with entropy regularization