Robust Hallucination Detection in LLMs via Adaptive Token Selection

📝 Paper Summary

Hallucination detection Internal representation analysis Uncertainty quantification

HaMI reformulates hallucination detection as a multiple instance learning problem to adaptively identify critical hallucinated tokens within free-form generations, augmenting them with uncertainty metrics for robust detection.

Core Problem

Existing hallucination detectors rely on predetermined tokens (e.g., last token) for analysis, which fails in free-form generation where hallucinated content is sparsely distributed and varies in position.

Why it matters:

Predetermined token selection (e.g., last token) overlooks the actual location of hallucinated entities, leading to poor detection accuracy in variable-length responses
Hallucinations pose safety risks in high-stakes fields like law and medicine, requiring reliable detection before deployment
Current methods often require expensive external LLMs or multiple sampling steps, increasing computational cost

Concrete Example: In a long response about a historical event, the hallucinated date might appear in the middle of the text. A detector analyzing only the 'last token's' internal state would miss this signal, whereas the hallucination is actually encoded in the specific token representing the incorrect date.

Key Novelty

Hallucination detection as Multiple Instance Learning (HaMI)

Treats each response sequence as a 'bag' of token instances, where a hallucinated response is a 'positive bag' containing at least a few hallucinated tokens
Optimizes a detector to assign high scores to the specific tokens most indicative of hallucination, rather than averaging or picking fixed positions
Augments standard internal hidden states with uncertainty metrics (like predictive probability or semantic entropy) directly in the feature space

Architecture

The overall framework of HaMI, illustrating the process from token generation to score prediction and loss calculation.

Evaluation Highlights

Significantly outperforms state-of-the-art methods like SAPLMA and HaloScope across four benchmarks (TriviaQA, SQuAD, NQ, BioASQ)
Achieves 8.1% to 11.9% average AUROC improvement over MARS-SE (a strong uncertainty-based baseline) on three different LLMs
Demonstrates robustness across different model families (LLaMA-3, Mistral) and sizes (8B to 70B)

Breakthrough Assessment

8/10

Novel formulation of hallucination detection as MIL addresses the critical flaw of fixed-token analysis. Strong empirical results across diverse datasets and models confirm its effectiveness over existing baselines.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of generated response sequences as hallucinated or trustworthy based on internal token representations

Inputs: Input question q and generated response sequence y = {y1, ..., yt}

Outputs: Binary label indicating if the sequence contains hallucination

Pipeline Flow

Generation: LLM generates response tokens and extracts hidden states
Feature Augmentation: Concatenate hidden states with uncertainty metrics
Scoring: Detector assigns hallucination scores to all tokens
Selection (MIL): Select top-k highest scoring tokens (potential hallucinations)
Optimization: Minimize loss to separate selected tokens in positive vs. negative bags

System Modules

LLM Generator

Generate response sequence and provide internal hidden states

Model or implementation: LLaMA-3.1-8B, Mistral-Nemo-Instruct, or LLaMA-3.3-Instruct-70B

Uncertainty Augmenter

Enhance hidden states with uncertainty scores (probability, perplexity, or semantic entropy)

Model or implementation: Mathematical calculation

HaMI Detector

Predict hallucination score for each token and classify the sequence

Model or implementation: 2-layer MLP (Hidden dim 256)

Novel Architectural Elements

MIL-based objective function applied to token sequences for hallucination detection
Adaptive token selection mechanism (Top-k selection) within the loss function
Input representation augmentation fusing hidden states with uncertainty metrics

Modeling

Base Model: LLaMA-3.1-8B, Mistral-Nemo-Instruct (12B), LLaMA-3.3-Instruct-70B

Training Method: Supervised learning of a lightweight detector (MLP) on top of frozen LLM states using MIL objective

Objective Functions:

Purpose: Maximize score of top-k tokens in positive bags and minimize score of top-k tokens in negative bags.

Formally: Binary Cross Entropy loss on selected top-k instances.
Purpose: Enforce smoothness of scores between adjacent tokens.

Formally: MSE between scores of token i and token i-1.

Trainable Parameters: 2-layer MLP detector weights

Training Data:

2,000 QA pairs for training, 800 for testing per dataset (TriviaQA, SQuAD, NQ, BioASQ)
Correctness labels provided by GPT-4.1

Key Hyperparameters:

hidden_dimension: 256
top_k_ratio: 0.1 (10% of sequence length)
lambda_smoothness: Implicit in overall loss formulation (Eq. 5)
+ 1 more
uncertainty_weight_lambda: 1.0

Compute: Not reported in the paper

Comparison to Prior Work

vs. SAPLMA/HaloScope: HaMI selects tokens adaptively via MIL rather than using fixed positions (e.g., last token)
vs. SE/MARS: HaMI combines internal states with uncertainty, whereas SE/MARS rely primarily on output probabilities or text consistency
vs. LLM-Check: HaMI trains an end-to-end detector rather than analyzing eigenvalues or simple logits [not cited in paper]

Limitations

Dependency on GPT-4 for ground-truth labeling, which may contain errors
Computational cost of extracting internal states for long sequences
Performance gains vary across different datasets and models

Reproducibility

Code: https://github.com/mala-lab/HaMI

Code is publicly available at https://github.com/mala-lab/HaMI. Datasets are standard benchmarks. LLMs used are open-weights. GPT-4.1 used for labeling (closed source).

📊 Experiments & Results

Evaluation Setup

Hallucination detection on QA tasks using open-source LLMs

Benchmarks:

TriviaQA (Confabulation QA)
SQuAD (Reading Comprehension QA)
Natural Questions (NQ) (Open-domain QA)
BioASQ (Biomedical QA)

Metrics:

AUROC
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HaMI outperforms both uncertainty-based and internal representation-based baselines across different LLMs.
Average across 4 datasets (LLaMA-3.1-8B)	AUROC	73.3	83.6	+10.3
Average across 4 datasets (Mistral-Nemo-12B)	AUROC	73.2	85.1	+11.9
Average across 4 datasets (LLaMA-3.3-70B)	AUROC	78.4	86.5	+8.1
Average across 4 datasets (LLaMA-3.1-8B)	AUROC	79.1	83.6	+4.5
TriviaQA (LLaMA-3.1-8B)	AUROC	74.7	84.9	+10.2

Experiment Figures

An illustration of hallucination locations in free-form text, showing that they are sparse and variable.

Main Takeaways

HaMI significantly outperforms state-of-the-art methods, particularly those relying on fixed token positions or single modalities.
The combination of internal representations and uncertainty metrics (Semantic Entropy) yields the best performance.
Adaptive token selection via MIL effectively handles variable-length responses where hallucination locations are unknown.
Improvements are consistent across different model sizes (8B, 12B, 70B) and architectures (LLaMA, Mistral).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and token generation
Familiarity with Multiple Instance Learning (MIL) concepts (bags vs. instances)
Knowledge of internal representations (hidden states) in Transformers

Key Terms

MIL: Multiple Instance Learning—a form of weakly supervised learning where labels are assigned to bags of items (sequences) rather than individual items (tokens), and the goal is to predict bag labels by identifying key instances

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings

hallucination: Unfaithful or incorrect generations produced by an LLM that deviate from facts or the input context

hidden states: The internal vector representations of tokens within the layers of a neural network (LLM) before the final output layer

predictive uncertainty: A measure of how unsure a model is about its prediction, often calculated using probabilities (logits) or entropy

semantic entropy: A metric that measures uncertainty by clustering generated answers based on meaning and calculating entropy over these clusters

perplexity: A measurement of how well a probability model predicts a sample; in LLMs, it reflects the 'surprise' of the model when generating text

hard negative: Tokens within a trustworthy (negative) response that look most similar to hallucinated tokens, used to train the model to be more discriminative