Agentic workflows: Systems where multiple LLMs interact (e.g., one generates, one evaluates) to solve tasks
WAFER-QA: Web-Augmented Feedback for Evaluating Reasoning—a benchmark of grounded adversarial critiques introduced in this paper
Parametric knowledge: Information stored internally in the model's weights during training, as opposed to information retrieved from external sources
Grounded-knowledge judge: An evaluator that uses external tools (like web search) to find evidence to support its critique
Hypercritical judge: A judge that always views the generator's answer as flawed, regardless of its actual correctness
Malicious judge: A judge that selectively intervenes only when the answer is correct, aiming to mislead the generator
Sycophancy: The tendency of a model to agree with the user or evaluator's beliefs/intent, even when they are wrong
Oscillatory answer patterns: Behavior where a model repeatedly switches back and forth between answers over multiple rounds of feedback