contextual snapshot: A static record of an agent's full state (history, instruction, observation) frozen at a specific decision point, used to test next-step prediction deterministically
accessibility tree: A hierarchical representation of a user interface (like a web page) used by agents to perceive elements (buttons, text) without processing raw pixels
risk setting: A specific scenario pattern (e.g., unachievable goal, pop-up distraction) identified as highly likely to trigger hallucinatory behavior
ReAct: Reasoning + Acting—a prompting paradigm where agents generate a thought trace before executing an action
LLM-as-a-Judge: Using a strong LLM to evaluate the outputs of other models, here used to verify if an agent's action is faithful to its context
unfaithful to task instructions: Agent actions that violate constraints or invent goals not present in the user prompt
unfaithful to execution history: Agent actions that contradict past events, such as repeating a failed action or ignoring a completed step
unfaithful to environment observations: Agent actions that interact with non-existent elements (e.g., clicking a fake button) or ignore visible state changes
DOM: Document Object Model—the structural representation of a web page that agents interact with
ZeroAcc: A metric measuring the judge's accuracy specifically on samples that should receive a score of 0 (hallucinated actions)