← Back to Paper List

HALT: Hallucination Assessment via Log-probs as Time series

Ahmad Shapiro, Karan Taneja, Ashok Goel
Georgia Institute of Technology
arXiv (2026)
Factuality Benchmark Reasoning

πŸ“ Paper Summary

Hallucination suppression Hallucination detection Benchmarks and evaluation
HALT detects hallucinations by treating the sequence of token log-probabilities as a time series to model calibration bias, while HUB provides a unified benchmark across ten diverse LLM capabilities.
Core Problem
Reliably detecting hallucinations in LLMs is difficult because access to internal states is often restricted (proprietary APIs), and aggregate statistics (like mean entropy) fail to capture temporal uncertainty dynamics.
Why it matters:
  • Hallucinations undermine user trust and limit deployment in high-stakes domains
  • White-box methods require unrealistic access to model internals, while black-box methods relying on external retrieval or auxiliary LLMs introduce latency, cost, and their own hallucinations
  • Existing benchmarks focus narrowly on knowledge-intensive tasks, neglecting reasoning failures (logical hallucinations) in code or math
Concrete Example: In an Algorithmic Reasoning task from HUB, a model correctly parses an input but fails an internal arithmetic step (1+1...+4=15). A standard fact-checker might miss this logical error, but the sequence of log-probabilities during that step likely exhibits a distinct uncertainty pattern.
Key Novelty
Hallucination Assessment via Log-probs as Time series (HALT)
  • Treats the top-k token log-probabilities at each generation step as a multi-dimensional time series rather than collapsing them into scalar summary statistics
  • Uses a lightweight Gated Recurrent Unit (GRU) to learn temporal patterns of model calibration bias (how confidence fluctuates over time) to distinguish hallucinations from correct outputs
  • Introduces HUB (Hallucination detection Unified Benchmark), extending detection to 'logical hallucinations' in reasoning tasks (math, code, symbolic) alongside traditional factual errors
Evaluation Highlights
  • HALT (5M parameters) outperforms Lettuce (a fine-tuned ModernBERT-base encoder, 30x larger) on the HUB benchmark
  • Achieves a 60x speedup gain compared to encoder-based methods like Lettuce
  • HUB benchmark covers 10 diverse capabilities, revealing that hallucination rates vary wildly (from ~40% in Chat to ~95% in World Knowledge validation sets)
Breakthrough Assessment
8/10
Significant for proposing a computationally efficient, black-box compatible detection method that works on closed APIs exposing log-probs. The unified benchmark covering logical hallucinations fills a major gap.
×