Hallucination Detection in LLMs: Fast and Memory-Efficient Fine-Tuned Models

📝 Paper Summary

Uncertainty Estimation Hallucination Detection

A fast, memory-efficient deep ensemble method that fine-tunes pre-trained LLMs using shared weights and rank-one fast weights (via BatchEnsemble and LoRA) to detect hallucinations accurately.

Core Problem

Deep ensembles provide robust uncertainty estimates for hallucination detection but are computationally prohibitive for Large Language Models (LLMs) due to the need to train and store multiple large models.

Why it matters:

LLMs frequently hallucinate (deviate from instructions or facts), posing severe risks in safety-critical fields like healthcare
Sample-based uncertainty methods often fail to capture true model uncertainty compared to ensembles
Existing ensemble methods for LLMs require massive compute resources, making them impractical for most practitioners

Concrete Example: When an LLM is asked a question not in its context (e.g., an unanswerable SQuAD 2.0 question), a single model might confidently invent an answer. A standard deep ensemble would detect this high uncertainty but requires 4x-10x the memory. The proposed method detects the high uncertainty using only one GPU.

Key Novelty

LoRA-based BatchEnsemble for LLMs

Adapt BatchEnsemble to fine-tune pre-trained LLMs rather than training from scratch, using LoRA to minimize trainable parameters
Represent each ensemble member using a single shared pre-trained weight matrix multiplied elementwise by member-specific rank-one 'fast weights'
Reformulate hallucination detection as a binary classification task using uncertainty metrics derived from this memory-efficient ensemble

Architecture

The proposed BatchEnsemble architecture applied to a pre-trained LLM with LoRA

Evaluation Highlights

Achieves 97.8% accuracy in detecting faithfulness hallucinations on SQuAD 2.0, outperforming sample-based baselines
Attains 68% accuracy in detecting factual hallucinations on MMLU without compromising predictive performance
Reduces memory complexity from linear O(M) to near-constant O(1) per added ensemble member, enabling training on a single A40 GPU

Breakthrough Assessment

7/10

Significant practical contribution by making deep ensembles feasible for LLMs on single GPUs. Performance is strong on faithfulness hallucinations, though factual hallucination detection shows mixed results against heavy-regularization baselines.

⚙️ Technical Details

Problem Definition

Setting: Uncertainty estimation for next-token prediction to classify outputs as hallucinated or not

Inputs: Input context sequence x_<t

Outputs: Predictive entropy H of the output distribution P(x_t | x_<t)

Pipeline Flow

Input Processing (Context/Prompt)
Ensemble Forward Pass (Shared Weights modulated by Member-Specific Fast Weights)
Uncertainty Calculation (Entropy Aggregation)
Binary Classifier (Hallucination Detection)

System Modules

Pre-trained LLM Backbone (Ensemble Core)

Provide shared base knowledge and feature extraction

Model or implementation: Mistral-7B-Instruct-v0.2

BatchEnsemble Layer (Ensemble Core)

Generate diverse ensemble member predictions using shared weights and rank-1 fast weights

Model or implementation: Custom BatchEnsemble linear layers with LoRA

Uncertainty Estimator

Calculate predictive entropy and decompose into aleatoric/epistemic uncertainty

Model or implementation: Statistical aggregation

Hallucination Classifier

Classify prediction as hallucination or correct based on uncertainty metrics

Model or implementation: Binary Classifier (e.g., Logistic Regression, Random Forest)

Novel Architectural Elements

Adaptation of BatchEnsemble for fine-tuning pre-trained transformers: Replaces random shared weights with pre-trained weights U = W_pretrained
Integration of LoRA with BatchEnsemble: Updates shared weights via LoRA (W + BA) while maintaining member diversity via fast weights

Modeling

Base Model: Mistral-7B-Instruct-v0.2

Training Method: Supervised Fine-Tuning (SFT) with BatchEnsemble and LoRA

Objective Functions:

Purpose: Minimize prediction error.

Formally: Standard Cross-Entropy Loss on target tokens

Adaptation: LoRA (rank=8, alpha=32) applied to all modules

Trainable Parameters: LoRA matrices (A, B) and BatchEnsemble fast weight vectors (r, s)

Training Data:

SQuAD and SQuAD 2.0 (mixed answerable/unanswerable)
MMLU (multiple choice questions)

Key Hyperparameters:

lora_rank: 8
lora_alpha: 32
ensemble_size: 4
+ 1 more
fast_weight_initialization_mean: 1.0

Compute: Single A40 GPU for training and inference

Comparison to Prior Work

vs. Deep Ensembles: Uses shared weights + rank-1 modulation to reduce memory from O(M) to near O(1)
vs. LoRA Ensemble: LoRA Ensemble assigns distinct LoRA matrices per member; this method shares LoRA matrices and uses fast weights for diversity
vs. Sample-based: Aggregates predictions from distinct parameterizations rather than just stochastic sampling from one model
+ 1 more
vs. MIMO (Multi-Input Multi-Output) [not cited in paper]: MIMO processes multiple inputs in one pass; BatchEnsemble processes one input with multiple effective weights in one pass

Limitations

Factual hallucination detection accuracy (68%) is lower than faithfulness detection
LoRA Ensemble baseline with high weight decay sometimes outperforms BatchEnsemble on uncertainty estimation
OOD (Out-of-Distribution) detection performance drops significantly compared to in-distribution tasks
Requires fine-tuning; cannot be applied zero-shot to frozen LLMs

Reproducibility

Code: https://github.com/Gabriel-Arteaga/LLM-Ensemble

Code publicly available at https://github.com/Gabriel-Arteaga/LLM-Ensemble. Uses Mistral-7B-Instruct-v0.2. Hyperparameters for LoRA (r=8, alpha=32) and ensemble size (4) are specified.

📊 Experiments & Results

Evaluation Setup

Hallucination detection via binary classification on uncertainty features

Benchmarks:

SQuAD 2.0 (Faithfulness Hallucination Detection (Unanswerable Questions))
MMLU (Factual Hallucination Detection (Multiple Choice))

Metrics:

Classification Accuracy (Hallucination vs. Correct)
F1 Score (Predictive Performance)
Exact Match (SQuAD)
Accuracy (MMLU Predictive)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Hallucination detection accuracy using uncertainty estimates. BatchEnsemble excels at faithfulness but trails LoRA Ensemble on factual tasks.
SQuAD 2.0	Classification Accuracy	95.5	97.8	+2.3
SQuAD 2.0	Classification Accuracy	96.4	97.8	+1.4
MMLU	Classification Accuracy	73.2	68.2	-5.0
Predictive performance on downstream tasks. BatchEnsemble maintains high performance while regularized baselines suffer.
SQuAD	F1	89.2	89.5	+0.3
SQuAD	F1	87.0	89.5	+2.5

Experiment Figures

Comparison of Inference Speed (left) and Parameter Size (right) vs. Ensemble Size

Main Takeaways

BatchEnsemble successfully scales to LLMs using LoRA, requiring only a single GPU for training and inference
Method achieves state-of-the-art accuracy (97.8%) in detecting faithfulness hallucinations (deviations from instructions)
Uncertainty spikes notably when the model encounters unanswerable questions, confirming internal awareness of ignorance
While LoRA Ensembles with heavy regularization offer better uncertainty for factual errors, they significantly degrade predictive performance (lower F1/EM scores), whereas BatchEnsemble maintains high predictive quality
Inference speed scales better than sample-based methods because BatchEnsemble processes all members in a single vectorized forward pass

📚 Prerequisite Knowledge

Prerequisites

Deep Ensembles
Low-Rank Adaptation (LoRA)
Predictive Entropy (Aleatoric vs Epistemic Uncertainty)
BatchEnsemble

Key Terms

BatchEnsemble: A parameter-efficient ensemble method where members share a weight matrix and differ only by rank-one 'fast weights' multiplied elementwise

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by updating only small low-rank matrices added to the frozen weights

Fast weights: In BatchEnsemble, small trainable vectors (rank-1 matrices) unique to each ensemble member that modulate the shared weights

Faithfulness hallucination: When an LLM deviates from the provided instructions or context (e.g., answering a question that the context says is unanswerable)

Factual hallucination: When an LLM generates content that contradicts verifiable real-world facts

Predictive entropy: A measure of uncertainty calculated from the distribution of predicted tokens; high entropy implies high uncertainty

Aleatoric uncertainty: Uncertainty arising from inherent noise or variability in the data (irreducible)

Epistemic uncertainty: Uncertainty arising from the model's lack of knowledge (reducible with more data)

SQuAD: Stanford Question Answering Dataset—a reading comprehension benchmark

MMLU: Massive Multitask Language Understanding—a benchmark evaluating models on factual knowledge across diverse subjects