AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

📝 Paper Summary

Agentic AI Hallucination detection Reliability and safety

AgentHallu is a benchmark for identifying exactly where and why hallucinations occur in multi-step agent trajectories, revealing that even advanced models struggle to pinpoint error origins, especially in tool use.

Core Problem

Existing hallucination evaluations focus on binary judgments of single-turn responses, failing to identify which specific step in a multi-step agent workflow (planning, tool use, reasoning) causes the initial divergence.

Why it matters:

Agentic hallucinations propagate: a small error in an early planning or tool parameter step can cascade, leading to incorrect final outcomes
Current binary metrics cannot diagnose the root cause of failure in sequential workflows, which is essential for debugging and building reliable autonomous systems
High-stakes applications require granular transparency to trust agent decisions, not just a final correct/incorrect label

Concrete Example: A planning step misdefines 'region X, Y, Z', which propagates into downstream Python tool parameters, eventually leading to a wrong answer. A binary detector just flags the final answer as wrong, but AgentHallu identifies 'Step 1' as the root cause and explains the mismatch.

Key Novelty

Automated Hallucination Attribution for Agents

Shifts focus from 'is this wrong?' (binary detection) to 'which step went wrong and why?' (step localization and causal explanation) in multi-step trajectories
Introduces a grounded taxonomy of agent hallucinations covering 5 categories (Planning, Retrieval, Reasoning, Human-Interaction, Tool-Use) derived from empirical analysis
Provides a dataset of 693 trajectories from 7 agent frameworks with dense annotations including the specific responsible step and natural language explanation

Architecture

The dataset construction pipeline for AgentHallu.

Evaluation Highlights

Gemini-2.5-Pro (best model) achieves only 41.1% accuracy in localizing the hallucination-responsible step
Performance drops significantly on tool-use hallucinations, with the best model achieving just 11.6% localization accuracy
Longer trajectories degrade performance: GPT-5 accuracy drops from 40.3% on short sequences (≤5 steps) to 23.9% on long sequences (≥11 steps)

Breakthrough Assessment

8/10

Establishes a critical new task (attribution) for the growing field of agents. The low performance of SOTA models (GPT-5/Gemini-2.5) demonstrates this is a non-trivial, unsolved problem essential for future reliability.

⚙️ Technical Details

Problem Definition

Setting: Multi-step agent trajectory analysis

Inputs: An agent trajectory τ consisting of a sequence of interaction units (thought, action, observation) and a final result y(τ)

Outputs: The index of the hallucination-responsible step t* (where the error originated) and a natural language explanation

Pipeline Flow

Query Collection (8 datasets)
Trajectory Generation (7 agent frameworks)
Filtering (Failure/Short/Trivial removal)
Annotation (Oracle reasoning paths + Human verification)

System Modules

Trajectory Generator (Data Construction)

Execute agents to solve queries and record thought-action-observation triplets

Model or implementation: 7 frameworks including SmolAgents, OpenManus, Magentic-One (mostly using GPT-4o/4.1)

Filter (Data Construction)

Remove non-deceptive failures (crashes), overly short trajectories, and trivial cases where all judges agree

Model or implementation: Heuristic scripts + Ensemble of LLM Judges (GPT-5, Gemini, DeepSeek, Qwen)

Annotator

Identify if hallucination occurred, locate the step, and explain why

Model or implementation: Human experts assisted by GPT-5/Gemini generated reasoning paths

Novel Architectural Elements

Three-stage filtering criterion combining heuristic checks with 'disagreement-based' selection (keeping only trajectories where 4 LLM judges disagree) to ensure difficulty
Integration of oracle-guided reasoning paths (generated by LLMs with access to ground truth) to assist human annotators in complex attribution

Modeling

Base Model: Various (13 models evaluated, including GPT-5, Gemini-2.5, Claude-3.5, Llama-3, Qwen-2.5)

Comparison to Prior Work

vs. RAGTruth: AgentHallu focuses on multi-step agent trajectories rather than single-turn RAG responses
vs. ToolBH: AgentHallu covers broader domains (planning, reasoning, etc.) beyond just tool use and requires precise step localization
vs. HaluEval: AgentHallu evaluates sequential error propagation in agents rather than static single-turn generation [not cited in paper]

Limitations

Evaluation relies on proprietary models (GPT-5, Gemini) which may change over time
The definition of 'responsible step' assumes a single primary cause (first error), which might oversimplify complex cascading failures
Manual annotation is labor-intensive, limiting the dataset size to 693 trajectories compared to larger automated benchmarks

Reproducibility

publicly available (https://liuxuannan.github.io/AgentHallu.github.io/). The dataset containing 693 annotated trajectories is released. Code for the evaluation framework is provided. Specific model weights for the proprietary models evaluated (GPT-5, Gemini-2.5) are not available as they are API-based.

📊 Experiments & Results

Evaluation Setup

Automated attribution of hallucinations in pre-recorded agent trajectories using LLM judges.

Benchmarks:

AgentHallu (Hallucination Attribution (Localization & Explanation)) [New]

Metrics:

Step Localization Accuracy (identifying the correct step t*)
G-EVAL (measuring quality of the natural language explanation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Step localization accuracy results showing that even the most capable proprietary models struggle with the task, particularly on tool-use errors.
AgentHallu	Step Localization Accuracy	Not reported in the paper	41.1	Not reported in the paper
AgentHallu	Step Localization Accuracy (Tool-Use)	41.1	11.6	-29.5
AgentHallu	Step Localization Accuracy	36.6	38.5	+1.9
AgentHallu	Step Localization Accuracy (GPT-5)	40.3	23.9	-16.4

Experiment Figures

A conceptual comparison between standard hallucination detection and the proposed hallucination attribution task.

Main Takeaways

Attribution is significantly harder than binary detection; models capable of flagging errors often fail to locate the specific responsible step.
Tool-use hallucinations are the most challenging category (11.6% accuracy), likely due to the complexity of diagnosing tool parameters and outputs.
Step-by-step prompting yields marginal gains over standard prompting but at much higher computational cost.
Proprietary models (Gemini-2.5, GPT-5) significantly outperform open-source models (Llama-3, Qwen-2.5) on this reasoning-intensive task.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based agents (ReAct loops, tool use)
Basic knowledge of hallucination in LLMs
Familiarity with trajectory evaluation metrics

Key Terms

Hallucination Attribution: The task of identifying the specific step in a sequence that caused a factual error and explaining the reason

Counterfactual Trajectory: A hypothetical trajectory generated by correcting a specific step to see if the final outcome becomes correct, used to define causality

Causality-aligned Principle: The rule that if multiple steps are erroneous, the very first error in the sequence is treated as the primary source of hallucination

G-EVAL: An evaluation framework using LLMs (typically GPT-4) to score generated text based on criteria like consistency and coherence

Trajectory: The chronological sequence of an agent's thoughts, actions (tool calls), and observations (tool outputs) leading to a final answer

Responsible Step: The specific interaction unit (thought/action/observation) where the agent first deviated from factual or logical correctness