← Back to Paper List

Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows

Yifei Ming, Zixuan Ke, Xuan-Phi Nguyen, Jiayu Wang, Shafiq Joty
arXiv (2025)
Agent Factuality Benchmark Reasoning RAG

📝 Paper Summary

Agentic feedback mechanisms Factuality and hallucination in LLMs Adversarial robustness
This paper reveals the fragility of agentic workflows by introducing a taxonomy of deceptive judge behaviors and the WAFER-QA benchmark, demonstrating that even reasoning models succumb to evidence-backed incorrect feedback.
Core Problem
Agentic workflows rely on feedback mechanisms (judges) to self-improve, but these systems are fragile because judges may hallucinate, exhibit bias, or provide deceptive feedback that destabilizes the agent's reasoning.
Why it matters:
  • Current evaluations assume judges are reliable/constructive, masking critical vulnerabilities in real-world deployments where feedback might be flawed or adversarial
  • Agents act sycophantically, prioritizing agreement with confident feedback over their own correct internal knowledge
  • Standard benchmarks do not test robustness against 'grounded' deception, where incorrect feedback is supported by retrieved web evidence
Concrete Example: When an agent correctly identifies Shakespeare as the author of Hamlet, a 'malicious parametric judge' might confidently claim: 'Recent scholarship suggests Christopher Marlowe was the principal writer,' causing the agent to doubt and change its correct answer.
Key Novelty
Two-Dimensional Judge Taxonomy & WAFER-QA Benchmark
  • Disentangles judge behavior into two axes: Intent (Constructive, Hypercritical, Malicious) and Knowledge (No-Knowledge, Parametric, Grounded) to model diverse feedback dynamics
  • Introduces WAFER-QA, a benchmark where 'grounded adversarial' critiques are generated by searching the web for evidence that supports plausible but *incorrect* answers
Architecture
Architecture Figure Figure 1
A conceptual illustration of the two-dimensional framework (Intent vs. Knowledge) and the impact of a Grounded Malicious Judge on an agent.
Evaluation Highlights
  • Performance drops exceeding 50% for top-tier models (GPT-4o and o3-mini) when exposed to grounded deceptive critiques
  • Models often switch from correct to incorrect answers after a single round of misleading feedback
  • Multi-round feedback interactions induce oscillatory answer patterns, indicating instability even in reasoning models
Breakthrough Assessment
8/10
Important contribution to agent safety. The taxonomy provides a structured way to analyze feedback vulnerabilities, and the finding that even o3-mini falls for grounded deception is significant.
×