HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs

📝 Paper Summary

Hallucination suppression White-box hallucination detection

HD-NDEs detects hallucinations by modeling the sequence of LLM internal states as continuous dynamic trajectories using Neural Differential Equations, rather than relying solely on the final token's representation.

Core Problem

Existing classification-based methods often rely on the hidden state of the final token to detect hallucinations, failing when non-factual information appears early or mid-sequence.

Why it matters:

Hallucinations in Large Language Models (LLMs) limit real-world deployment by producing inaccurate or non-factual statements
Current methods struggle to capture the reliability of the entire sequence if the error occurs before the final token, reducing detection accuracy
Verification via external retrieval is computationally expensive and slow for high-throughput applications

Concrete Example: When an LLM answers a question incorrectly (e.g., 'The first virus was discovered by...'), the hidden state of the *last* token might look nearly identical to that of a correct answer (as shown in PCA analysis), even if the middle tokens diverged significantly.

Key Novelty

Hallucination Detection via Neural Differential Equations (HD-NDEs)

Treats the sequence of token hidden states as a continuous-time dynamic system rather than discrete, independent points
Uses Neural ODEs, CDEs, and SDEs to model the 'trajectory' of thought within the LLM's latent space, capturing how information evolves over the entire generation process
Maps this full dynamic trajectory to a classification space to determine truthfulness, capturing early-sequence errors that final-token classifiers miss

Evaluation Highlights

Achieves over 14% improvement in AUC-ROC on the True-False dataset compared to state-of-the-art techniques
Consistently outperforms baseline methods (like SAPLMA and ITI) across five datasets (TruthfulQA, SQuAD, etc.) and six LLMs (including LLaMA-2-7B and Vicuna-7B)
Neural CDEs generally yield the highest detection performance among the three differential equation variants (ODE, CDE, SDE) tested

Breakthrough Assessment

7/10

Novel application of Neural Differential Equations to the specific problem of hallucination detection. The theoretical motivation (modeling dynamics) addresses a clear weakness in prior snapshot-based methods, and empirical gains are significant.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of generated text sequences based on internal model states

Inputs: A sequence of token hidden states x derived from the LLM's last layer during generation

Outputs: Probability P(ξ|o) where ξ=1 indicates hallucination and ξ=0 indicates factual content

Pipeline Flow

Feature Extraction (LLM Hidden States)
Dimensionality Reduction (PCA)
Dynamic Modeling (Neural DE Solver)
Classification Head

System Modules

Feature Extractor (Input Processing)

Extract internal state representations from the LLM

Model or implementation: Target LLM (e.g., LLaMA-2-7B, Vicuna-7B)

Projection Layer (Input Processing)

Reduce high-dimensional hidden states to a manageable size for the ODE solver

Model or implementation: PCA (Principal Component Analysis)

Neural DE Solver

Model the continuous trajectory of the latent state z(t)

Model or implementation: Neural ODE, Neural CDE, or Neural SDE network

Classifier

Predict probability of hallucination based on the modeled trajectory

Model or implementation: Linear layer + Sigmoid

Novel Architectural Elements

Integration of Neural Differential Equation solvers (ODE/CDE/SDE) directly on top of projected LLM hidden states to capture temporal dynamics for truthfulness classification

Modeling

Base Model: Evaluated on: LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-7B-Chat, Vicuna-7B-v1.5, Vicuna-13B-v1.5, Mistral-7B-v0.1

Training Method: Supervised training of the Neural DE and classifier components (frozen LLM)

Objective Functions:

Purpose: Minimize binary classification error.

Formally: Standard binary cross-entropy loss L(θ) based on predicted probability and ground truth label.

Adaptation: Not applicable (LLM is frozen; detector is trained)

Trainable Parameters: Parameters of the Neural DE functions (f, g, h) and the final classifier

Training Data:

True-False Dataset
TruthfulQA
SQuAD
HaluEval
XSum (summarization)

Key Hyperparameters:

ode_solver: Fourth-order Runge-Kutta (RK4) for ODE/CDE
sde_solver: Euler-Maruyama for SDE
adjoint_method: Used for memory-efficient gradient computation

Compute: Not reported in the paper

Comparison to Prior Work

vs. SAPLMA: SAPLMA uses only the final token's state; HD-NDEs models the entire sequence trajectory.
vs. Logit-based: HD-NDEs uses internal hidden states rather than output probabilities.
vs. Geometry-based (like ITI): HD-NDEs focuses on temporal dynamics (trajectory) rather than static geometric directions [not cited in paper].

Limitations

Computationally more intensive than simple linear classifiers due to ODE solving steps
Requires access to internal model weights (white-box), unlike logit-based or consistency-based methods
Performance depends on the quality of the dimensionality reduction (PCA) step

Reproducibility

No code availability statement provided. Mathematical formulations for Neural ODEs/CDEs/SDEs are detailed. Hyperparameters like specific learning rates or batch sizes are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Supervised binary classification (Hallucination vs. Factual) on diverse datasets

Benchmarks:

True-False Dataset (Factual QA)
TruthfulQA (QA targeting imitative falsehoods)
SQuAD (Reading Comprehension)
HaluEval (Hallucination Evaluation)
XSum (Abstractive Summarization)

Metrics:

AUC-ROC
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HD-NDEs consistently outperforms the SAPLMA baseline across multiple datasets.
True-False Dataset	AUC-ROC	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

HD-NDEs achieves over 14% improvement in AUC-ROC on the True-False dataset compared to state-of-the-art techniques.
Neural CDEs generally perform best among the DE variants because they incorporate the controlled path of the underlying data.
The method is effective across varied tasks including QA (TruthfulQA, SQuAD) and summarization (XSum).
Using the full trajectory of hidden states captures hallucinations that occur mid-sequence, which final-token methods miss.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer internal states (hidden layers)
Basic knowledge of Ordinary Differential Equations (ODEs)
Familiarity with binary classification metrics (AUC-ROC)

Key Terms

Neural DEs: Neural Differential Equations—a family of models (ODEs, CDEs, SDEs) that use neural networks to parameterize the derivative of a state, modeling continuous dynamics

Neural CDEs: Neural Controlled Differential Equations—a variant where the system evolves based on a continuous control path derived from the input data stream

Neural SDEs: Neural Stochastic Differential Equations—a variant incorporating random noise (Brownian motion) to model uncertainty in the system dynamics

AUC-ROC: Area Under the Receiver Operating Characteristic Curve—a metric measuring the ability of a classifier to distinguish between classes at various threshold settings

PCA: Principal Component Analysis—a technique for reducing the dimensionality of data while preserving as much variance as possible

logit-based methods: Approaches that use the raw output scores (logits) of the model to estimate uncertainty or probability

Euler method: A basic numerical procedure for solving ordinary differential equations with a given initial value

Runge-Kutta: A family of more advanced iterative methods for approximating solutions to ordinary differential equations

Adjoint sensitivity method: A technique to compute gradients for Neural ODEs by solving a second, backward-time ODE, allowing backpropagation without storing all intermediate steps

Brownian motion: Random motion of particles, used here mathematically to introduce stochastic noise into the Neural SDE model