Agentic Search: Systems where agents autonomously browse the web, synthesize information, and return citation-backed answers (e.g., Deep Research)
Agent-as-a-Judge: Using an autonomous AI agent to evaluate the outputs of another AI system, often by verifying claims against external tools or rubrics
Rubric Tree: A hierarchical evaluation structure where a task is broken down into granular criteria (leaf nodes) aggregated to form a final score
Time-varying tasks: Tasks where the correct answer changes over time (e.g., stock prices, weather, availability), requiring real-time verification
Attribution: The practice of citing sources (URLs) that factually support the statements made in the generated answer
Partial Completion: A metric representing the average root node score (0 to 1) across tasks, reflecting partial satisfaction of criteria
Pass@3: A metric indicating whether at least one of three independent attempts for a task resulted in a full success score of 1
Deep Research systems: Agents optimized for long-horizon information gathering, often capable of running for extended periods (30+ mins) to synthesize reports
Generation-verification asymmetry: The concept that generating a complex answer is computationally/cognitively harder than verifying if a specific answer meets defined criteria