HALT: Hallucination Assessment via Log-probs as Time series

📝 Paper Summary

Hallucination suppression Hallucination detection Benchmarks and evaluation

HALT detects hallucinations by treating the sequence of token log-probabilities as a time series to model calibration bias, while HUB provides a unified benchmark across ten diverse LLM capabilities.

Core Problem

Reliably detecting hallucinations in LLMs is difficult because access to internal states is often restricted (proprietary APIs), and aggregate statistics (like mean entropy) fail to capture temporal uncertainty dynamics.

Why it matters:

Hallucinations undermine user trust and limit deployment in high-stakes domains
White-box methods require unrealistic access to model internals, while black-box methods relying on external retrieval or auxiliary LLMs introduce latency, cost, and their own hallucinations
Existing benchmarks focus narrowly on knowledge-intensive tasks, neglecting reasoning failures (logical hallucinations) in code or math

Concrete Example: In an Algorithmic Reasoning task from HUB, a model correctly parses an input but fails an internal arithmetic step (1+1...+4=15). A standard fact-checker might miss this logical error, but the sequence of log-probabilities during that step likely exhibits a distinct uncertainty pattern.

Key Novelty

Hallucination Assessment via Log-probs as Time series (HALT)

Treats the top-k token log-probabilities at each generation step as a multi-dimensional time series rather than collapsing them into scalar summary statistics
Uses a lightweight Gated Recurrent Unit (GRU) to learn temporal patterns of model calibration bias (how confidence fluctuates over time) to distinguish hallucinations from correct outputs
Introduces HUB (Hallucination detection Unified Benchmark), extending detection to 'logical hallucinations' in reasoning tasks (math, code, symbolic) alongside traditional factual errors

Evaluation Highlights

HALT (5M parameters) outperforms Lettuce (a fine-tuned ModernBERT-base encoder, 30x larger) on the HUB benchmark
Achieves a 60x speedup gain compared to encoder-based methods like Lettuce
HUB benchmark covers 10 diverse capabilities, revealing that hallucination rates vary wildly (from ~40% in Chat to ~95% in World Knowledge validation sets)

Breakthrough Assessment

8/10

Significant for proposing a computationally efficient, black-box compatible detection method that works on closed APIs exposing log-probs. The unified benchmark covering logical hallucinations fills a major gap.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of an LLM response as 'hallucinated' or 'faithful' based solely on output token log-probabilities

Inputs: Sequence of top-k log-probability vectors from an LLM generation

Outputs: Binary label (Hallucination vs. Not Hallucination)

Pipeline Flow

Feature Extraction (Log-probs & Statistics)
Sequence Modeling (GRU)
Pooling & Classification

System Modules

Feature Extractor

Extract raw top-20 log-probs and compute derived summary statistics for each token step

Model or implementation: Deterministic algorithmic component

HALT Encoder

Process the sequence of feature vectors to capture temporal uncertainty patterns

Model or implementation: Bidirectional GRU (5M parameters)

Classifier Head

Aggregate sequence information and predict hallucination probability

Model or implementation: Top-q pooling followed by linear layer

Novel Architectural Elements

Use of a GRU to model the entire trajectory of top-k log-probabilities as a time series for hallucination detection
Combination of raw top-k vectors with specific derived entropy/rank features in the time-series input

Modeling

Base Model: HALT (5M parameter GRU)

Training Method: Supervised learning (Binary Classification)

Objective Functions:

Purpose: Minimize classification error between predicted hallucination probability and ground truth labels.

Formally: Binary Cross-Entropy Loss.

Training Data:

HUB Benchmark splits: Training on Chat, Data-to-Text, and QA clusters
Validation and Test sets include held-out clusters (Reasoning, World Knowledge)

Key Hyperparameters:

k_log_probs: 20
d_stats: 5
model_parameters: 5 million

Compute: 30x smaller than ModernBERT-base; 60x speedup gain on HUB

Comparison to Prior Work

vs. Lettuce: HALT is 30x smaller, 60x faster, and uses only log-probs (no surface text analysis)
vs. SelfCheckGPT: HALT uses a single generation's log-prob trajectory rather than requiring multiple costly samples [not cited in paper]
vs. BSDetector: HALT models the *sequence* of probabilities via GRU rather than collapsing them into mean/max statistics, capturing temporal dynamics

Limitations

Detector trained on one LLM does not transfer reliably to another due to differing calibration biases
Requires access to token-level log-probabilities, which some commercial APIs may not expose
Performance depends on the base LLM's calibration; extremely poor calibration might obscure signals
No analysis provided for very long context or infinite generation scenarios

Reproducibility

Code availability is not explicitly provided in the paper text. Two HALT variants (HALT-L trained on Llama 3.1-8B log-probs and HALT-Q trained on Qwen 2.5-7B log-probs) are mentioned as released. HUB benchmark construction details are provided.

📊 Experiments & Results

Evaluation Setup

Binary classification of hallucinations across diverse tasks

Benchmarks:

HUB (Hallucination detection Unified Benchmark) (Varied (Chat, QA, Summarization, Code, Math, Symbolic Reasoning)) [New]
HaluEval (QA, Dialogue, Summarization)
FAVA (Knowledge-intensive queries)
RAGTruth (RAG-based generation)

Metrics:

Macro-F1
Inference Speed (Speedup)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HUB	Parameters	150000000	5000000	-145000000
HUB	Speedup	1	60	59
HALT outperforms the encoder-based baseline on the HUB benchmark, demonstrating that log-prob dynamics are more discriminative than surface text embeddings for this task.
HUB	Performance	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

Sequence-based modeling of log-probabilities (HALT) outperforms text-encoder approaches (Lettuce) while being significantly smaller and faster.
Modeling the *trajectory* of uncertainty (time series) captures signals that aggregate statistics (mean, entropy) miss.
Detectors are model-specific: A detector trained on Llama 3.1 does not transfer to Qwen 2.5, confirming the hypothesis that calibration bias is unique to the model architecture/weights.
HUB reveals that hallucination rates are highly task-dependent, ranging from ~40% in Chat to ~95% in World Knowledge tasks, necessitating macro-averaged evaluation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and token generation
Basic knowledge of time-series classification
Familiarity with Recurrent Neural Networks (specifically GRUs)

Key Terms

log-probabilities: The logarithm of the probability assigned by the model to a specific token; a measure of model confidence

calibration bias: The systematic discrepancy between a model's predicted probabilities and the actual correctness of its outputs

GRU: Gated Recurrent Unit—a type of recurrent neural network that processes sequential data and can capture long-term dependencies

logical hallucinations: Errors where the model fails in reasoning or internal logic (e.g., bad math, invalid code logic) rather than just retrieving incorrect facts

factual hallucinations: Errors where the model generates information that contradicts established reality or the provided context

black-box: A setting where the internal weights and states of a model are not accessible, only its inputs and outputs (and sometimes log-probs)

teacher-forcing: A training technique where the model is fed the ground-truth previous token (or the actual generated token in this context) rather than its own prediction to compute probabilities for the sequence