HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

📝 Paper Summary

Hallucination suppression Uncertainty Estimation

HaluNet is a lightweight neural framework that detects hallucinations in single-pass LLM generation by fusing token probabilities, predictive entropy, and hidden semantic embeddings.

Core Problem

Existing hallucination detection methods either rely on expensive sampling-based consistency checks or capture only single types of uncertainty (probabilistic or semantic), missing the complementary signals needed for robust detection.

Why it matters:

LLM hallucinations in Question Answering compromise reliability in search engines and autonomous agents
Sampling-based methods (like SelfCheckGPT) are too computationally costly for real-time deployment
Human annotation for hallucination labels is expensive and hard to scale, limiting the training of supervised detectors

Concrete Example: A model might generate a confident-sounding but factually wrong answer. A probability-based detector might miss this if token likelihoods are high (due to fluency), while a semantic method might catch it via embedding anomalies. HaluNet combines both signals to detect the error where single-feature methods fail.

Key Novelty

HaluNet: Multi-Granular Uncertainty Fusion

Unifies three distinct uncertainty signals—token log-likelihoods (confidence), predictive entropy (distributional uncertainty), and hidden states (semantic trajectory)—into a single trainable network
Uses a multi-branch architecture where scalar features are processed by MLPs and embeddings by CNNs, fused via attention or concatenation
Trains on 'pseudo-gold' labels generated by an LLM-as-a-Judge, eliminating the need for expensive human annotation while enabling supervised learning

Architecture

The architecture of HaluNet showing how token-level features are processed and fused.

Evaluation Highlights

Outperforms the strongest non-trained baseline by +6.7% AUROC on SQuAD (Llama3-8B)
Achieves 0.893 AUROC on TriviaQA with Llama3-8B, a +6.6% improvement over baselines
Maintains sub-second inference speed (single-pass), whereas sampling-based baselines like SelfCheckGPT are orders of magnitude slower

Breakthrough Assessment

7/10

Strong practical contribution uniting disparate uncertainty signals into a lightweight, effective framework. While the components (entropy, embeddings) are known, the fusion architecture and supervision strategy yield SOTA efficiency-accuracy balance.

⚙️ Technical Details

Problem Definition

Setting: Answer-level hallucination detection in QA

Inputs: Context c, Question q, Generated Answer a = (x_1, ..., x_L)

Outputs: Hallucination score s in [0, 1]

Pipeline Flow

Feature Extraction: Extract log-likelihoods, entropies, and hidden states from the LLM during generation
Branch Encoding: Process scalars via MLP and embeddings via CNN
Fusion: Combine branch outputs via Attention or Concatenation
Prediction: Output hallucination probability

System Modules

Feature Extractor

Extracts raw uncertainty signals from the frozen LLM generation process

Model or implementation: Llama3-8B or Qwen3-14B (Layer 20)

Branch Encoders

Projects raw features into latent vectors

Model or implementation: CNN (kernel=3, filters=64) for embeddings; MLP for scalars

Fusion Module

Integrates latent representations and predicts hallucination score

Model or implementation: Concatenation + 2-layer MLP (hidden dim=64)

Novel Architectural Elements

Multi-branch topology processing heterogeneous inputs (scalars vs. vectors) with distinct encoders (MLP vs. CNN) before fusion
Integration of semantic embeddings specifically via 1D Convolution to capture local semantic trajectory anomalies

Modeling

Base Model: Llama3-8B and Qwen3-14B

Training Method: Supervised learning on a lightweight detector network (LLM backbone is frozen)

Objective Functions:

Purpose: Minimize prediction error against pseudo-gold labels.

Formally: Binary Cross Entropy loss L(ϕ) = -1/N * sum(y_i * log(p_i) + (1-y_i) * log(1-p_i))

Adaptation: None (The LLM itself is not adapted; HaluNet is an external probe)

Training Data:

SQuAD, TriviaQA, NQ-Open (10K samples each)
Context ratio 0.5 (balanced context-present and context-free)

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 32
epochs: 20
+ 3 more
hidden_dim: 64
cnn_filters: 64
extraction_layer: 20

Compute: Single-pass inference (sub-second); training is lightweight

Comparison to Prior Work

vs. PE/T-NLL: HaluNet combines these probabilistic signals with semantic embeddings for richer context
vs. SelfCheckGPT/SE: HaluNet is single-pass (efficient) vs. multi-sample (slow), and trainable vs. heuristic
vs. SAPLMA [not cited in paper]: SAPLMA uses a separate probe on hidden states; HaluNet fuses states with token probabilities

Limitations

Relies on the quality of LLM-as-a-Judge pseudo-labels, which may contain biases or errors
Performance drops in Out-of-Distribution (OOD) settings compared to In-Distribution
Requires access to internal model internals (logits, hidden states), limiting use with closed-source APIs
Layer selection (Layer 20) is a heuristic that may not transfer to all model architectures

Reproducibility

Code availability is not provided. Benchmark datasets (SQuAD, TriviaQA, NQ) are public. Method relies on pseudo-labels from LLM-as-a-Judge, exact prompts for which are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Answer-level hallucination detection on open-domain QA

Benchmarks:

SQuAD (Reading Comprehension / QA)
TriviaQA (Open-domain QA)
NQ-Open (Open-domain QA)

Metrics:

AUROC
F1@B
RA@50
AURAC
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SQuAD (Context)	AUROC	0.772	0.839	+0.067
TriviaQA (Context)	AUROC	0.827	0.893	+0.066
SQuAD (No Context)	AUROC	0.687	0.763	+0.076
NQ (No Context)	AUROC	0.761	0.793	+0.032
Ablation study demonstrates that combining all features yields the best performance, with embeddings providing the largest individual contribution.
Average (CR=0)	AUROC	0.765	0.781	+0.016

Experiment Figures

Out-of-Distribution (OOD) generalization performance (AUROC and RA@50) when training on one dataset and testing on others.

Layer-wise ablation study showing detection performance using features from different transformer layers.

Main Takeaways

HaluNet consistently outperforms both single-feature methods (PE, T-NLL) and expensive sampling methods (SelfCheckGPT) across context and no-context settings.
Embeddings (hidden states) are the single most informative feature, but fusing them with probability/entropy signals yields robust gains.
Middle-to-late layers (around layer 20) contain the most useful hallucination signals; early layers lack abstraction, and the final layers are too task-specialized.
The method generalizes reasonably well to OOD datasets, though some performance drop is observed compared to in-distribution training.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM generation (logits, probabilities)
Familiarity with aleatoric vs. epistemic uncertainty
Basic knowledge of neural networks (CNNs, MLPs, Attention)

Key Terms

LLM-as-a-Judge: Using a strong LLM to evaluate the correctness of another model's output, acting as a proxy for human labeling

Predictive Entropy (PE): A measure of uncertainty based on the flatness of the output probability distribution; high entropy means the model is unsure which token to pick

Aleatoric Uncertainty: Uncertainty arising from inherent ambiguity or noise in the data/context

Epistemic Uncertainty: Uncertainty arising from a lack of knowledge in the model parameters

AUROC: Area Under the Receiver Operating Characteristic curve—a metric measuring how well a classifier separates positive (hallucinated) and negative (factual) classes

Token Negative Log-Likelihood (T-NLL): The negative logarithm of the probability assigned to the generated token; a direct measure of model confidence

Hidden States: The internal vector representations (embeddings) of tokens within the transformer layers, capturing semantic meaning

SQuAD: Stanford Question Answering Dataset—a reading comprehension benchmark

CNN: Convolutional Neural Network—used here to process the sequence of embeddings and capture local dependencies

MLP: Multilayer Perceptron—a simple feedforward neural network

OOD: Out-of-Distribution—evaluating the model on data types or domains it wasn't trained on