MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them

📝 Paper Summary

Agentic AI Factuality and Hallucination Benchmarks and Evaluation

MIRAGE-Bench creates a unified testbed for eliciting and evaluating LLM agent hallucinations by isolating risk-prone decision points within deterministic contextual snapshots.

Core Problem

Hallucinations in LLM agents manifest as dangerous actions rather than just text, but existing evaluations are fragmented, hard to reproduce due to stochastic environments, and difficult to verify without ground truth.

Why it matters:

Agentic hallucinations translate directly into real-world risks, such as leaking user credentials or deleting files
Current benchmarks like WebArena or SWE-Bench focus on task success rates, missing the specific diagnosis of when and why agents become unfaithful to their context
Stochastic branching in interactive environments makes reproducing specific hallucination failures unreliable for consistent benchmarking

Concrete Example: In TheAgentCompany, an agent instructed to message 'Mark Johnson' hallucinates that it has already navigated to Mark's page (despite the observation showing 'Mike Chen'). It then sends Mark's password to Mike, causing a data leak.

Key Novelty

Contextual Snapshot Evaluation Strategy

Freezes agent execution at specific 'risk-triggering' steps (e.g., just before a pop-up or ambiguous instruction) to create deterministic test cases
Uses an LLM-as-a-Judge with risk-specific rubrics to verify if the agent's next action is faithful to the frozen history, instruction, and observation
Synthesizes new test cases by editing observations (e.g., injecting diverse out-of-scope queries) within the snapshots to scale evaluation without full environment re-execution

Architecture

The MIRAGE-Bench pipeline for constructing and scaling contextual snapshots from risk-inducing tasks

Evaluation Highlights

Proprietary models like GPT-4o-2024-11-20 still hallucinate actions frequently (Hallucination Rate 0.339), showing limited improvement over open weights
Open-source Qwen2.5-32B-Instruct achieves competitive reliability (Utility Score 0.581) comparable to top proprietary models like GPT-4o (Utility Score 0.569)
Stronger models like Claude-3.5-Sonnet show slightly higher susceptibility (0.08 rate) to pop-up distractions than weaker models, likely due to increased perceptual attention to irrelevant cues

Breakthrough Assessment

8/10

Significant methodological advance in evaluating agent reliability by converting stochastic interactions into deterministic snapshots. Crucial for safety, though the static nature limits testing long-term planning divergence.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the faithfulness of an LLM agent's next predicted action $a_{t+1}$ given a static context tuple $(I, H_t, O_t)$

Inputs: Task Instruction ($I$), Interaction History ($H_t$), and Current Observation ($O_t$) containing a specific risk trigger

Outputs: Agent action (code/function call) and subsequent utility score (0, 0.5, or 1)

Pipeline Flow

Risk Audit: Identify failure patterns in existing benchmarks
Snapshot Extraction: Run agents, freeze at risk points, save context
Synthesis: Scale snapshots by editing observations (e.g., injecting queries)
Evaluation: Query target LLM for next action → Judge with o4-mini

System Modules

Snapshot Extractor (Data Construction)

Harvest interaction traces from benchmarks (WebArena, etc.) and freeze them at moments matching risk definitions

Model or implementation: N/A (Process)

Context Synthesizer (Data Construction)

Scale the dataset by programmatically or logically altering observations in snapshots (e.g., changing text in accessibility trees)

Model or implementation: o4-mini

Target Agent (Evaluation)

The LLM being benchmarked; generates the next action given the snapshot

Model or implementation: Various (Llama-3, GPT-4o, etc.)

LLM Judge (Evaluation)

Verify if the agent's action is faithful to the provided context

Model or implementation: o4-mini

Novel Architectural Elements

Taxonomy-driven snapshot pipeline: Organizes evaluation around 3 faithfulness axes (Instruction, History, Observation) rather than task success
Deterministic replay of stochastic agent states: Decouples evaluation from environment simulators by using frozen contexts

Modeling

Base Model: Benchmarked models include Llama-3.x (70B), Qwen2.5 (7B-72B), GPT-4o, Claude-3.5/3.7, Gemini-2.0/2.5

Training Method: N/A (Evaluation only benchmark)

Compute: Not reported in the paper

Comparison to Prior Work

vs. WebArena/OSWorld: MIRAGE focuses on 'faithfulness' of single steps rather than final 'success rate'; isolates hallucination from planning failure
vs. HaluEval: Evaluates 'actions' (clicks, code) rather than text; accounts for dynamic history and observations
vs. AgentBench: Uses static snapshots for reproducibility instead of full interactive rollouts

Limitations

Static evaluation does not capture long-term compounding errors that arise from multi-step hallucination
Relying on o4-mini as a judge may introduce bias, though validated against human annotations
Current scope is limited to text/code-based actions; does not fully cover multi-modal visual hallucinations in pixel space
Focuses on 6 specific risk settings, potentially missing other rare or domain-specific triggers

Reproducibility

Code: https://github.com/sunblaze-ucb/mirage-bench.git

Code and dataset publicly available at https://github.com/sunblaze-ucb/mirage-bench.git. The benchmark uses o4-mini as the primary judge; reproduction requires OpenAI API access. Detailed prompts for the judge are provided in the appendix.

📊 Experiments & Results

Evaluation Setup

LLM-as-a-Judge evaluation of single-step actions across 6 diverse risk settings (e.g., unexpected transitions, pop-ups)

Benchmarks:

MIRAGE-Bench (Action Faithfulness Evaluation) [New]

Metrics:

Utility Score (US) - Average faithfulness score (0-1)
Hallucination Rate (HR) - Proportion of actions scored as 0
Statistical methodology: Validated judge agreement with humans (Accuracy > 0.75) and self-consistency checks

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance on MIRAGE-Bench, showing widespread hallucination across all model classes.
MIRAGE-Bench (Overall)	Hallucination Rate (HR)	0.324	0.339	+0.015
MIRAGE-Bench (Overall)	Utility Score (US)	0.589	0.589	0.000
Pop-up Distractions (PUD)	Hallucination Rate (HR)	0.000	0.082	+0.082
Out of Scope Queries (OSQ)	Utility Score (US)	0.560	0.622	+0.062

Experiment Figures

Examples of specific risk settings (Unexpected Environmental Transitions, Out of Scope Queries) with input context and faithful vs. hallucinated actions

Main Takeaways

Hallucination is persistent: No model achieves high utility (>0.7), and hallucination rates remain >30% even for top proprietary models
Open-source models like Qwen2.5-32B are surprisingly competitive with GPT-4o, suggesting scaling alone isn't solving agentic hallucination
Presumptive behavior: Agents often hallucinate 'success' (e.g., assuming a page loaded) or fabricate answers to questions they can't answer, likely due to dialogue-based instruction tuning bias
Inverse scaling on distractions: More capable models can be slightly more susceptible to pop-up distractions because they process the irrelevant context rather than ignoring it like simpler models

📚 Prerequisite Knowledge

Prerequisites

Familiarity with ReAct (Reasoning + Acting) agent frameworks
Understanding of LLM-as-a-Judge evaluation paradigms
Basic knowledge of web/OS agent environments (DOM trees, accessibility trees)

Key Terms

contextual snapshot: A static record of an agent's full state (history, instruction, observation) frozen at a specific decision point, used to test next-step prediction deterministically

accessibility tree: A hierarchical representation of a user interface (like a web page) used by agents to perceive elements (buttons, text) without processing raw pixels

risk setting: A specific scenario pattern (e.g., unachievable goal, pop-up distraction) identified as highly likely to trigger hallucinatory behavior

ReAct: Reasoning + Acting—a prompting paradigm where agents generate a thought trace before executing an action

LLM-as-a-Judge: Using a strong LLM to evaluate the outputs of other models, here used to verify if an agent's action is faithful to its context

unfaithful to task instructions: Agent actions that violate constraints or invent goals not present in the user prompt

unfaithful to execution history: Agent actions that contradict past events, such as repeating a failed action or ignoring a completed step

unfaithful to environment observations: Agent actions that interact with non-existent elements (e.g., clicking a fake button) or ignore visible state changes

DOM: Document Object Model—the structural representation of a web page that agents interact with

ZeroAcc: A metric measuring the judge's accuracy specifically on samples that should receive a score of 0 (hallucinated actions)