Evaluation Setup
Automated attribution of hallucinations in pre-recorded agent trajectories using LLM judges.
Benchmarks:
- AgentHallu (Hallucination Attribution (Localization & Explanation)) [New]
Metrics:
- Step Localization Accuracy (identifying the correct step t*)
- G-EVAL (measuring quality of the natural language explanation)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Step localization accuracy results showing that even the most capable proprietary models struggle with the task, particularly on tool-use errors. |
| AgentHallu |
Step Localization Accuracy |
Not reported in the paper |
41.1 |
Not reported in the paper
|
| AgentHallu |
Step Localization Accuracy (Tool-Use) |
41.1 |
11.6 |
-29.5
|
| AgentHallu |
Step Localization Accuracy |
36.6 |
38.5 |
+1.9
|
| AgentHallu |
Step Localization Accuracy (GPT-5) |
40.3 |
23.9 |
-16.4
|
Main Takeaways
- Attribution is significantly harder than binary detection; models capable of flagging errors often fail to locate the specific responsible step.
- Tool-use hallucinations are the most challenging category (11.6% accuracy), likely due to the complexity of diagnosing tool parameters and outputs.
- Step-by-step prompting yields marginal gains over standard prompting but at much higher computational cost.
- Proprietary models (Gemini-2.5, GPT-5) significantly outperform open-source models (Llama-3, Qwen-2.5) on this reasoning-intensive task.