Deep Research (DR): Autonomous LLM-based systems that conduct multi-step web exploration, targeted retrieval, and synthesis to answer open-ended queries
LLM-as-a-judge: Using a strong LLM to evaluate the outputs of other models based on specific criteria
Rubric: A set of specific criteria used to grade subjective or complex work; here, expert-written rules for what a good answer must contain
Ternary Grading: A grading scale with three values (Satisfied, Partially Satisfied, Not Satisfied) rather than just Pass/Fail
Macro F1: A metric that calculates the F1 score (harmonic mean of precision and recall) for each class independently and then takes the average, treating all classes equally
Anchoring Bias: Cognitive bias where reliance on an initial piece of information (e.g., an LLM-generated rubric) heavily influences subsequent judgments
Conceptual Breadth: One of the paper's complexity axes; the number and diversity of distinct topics or domains involved in a query
Logical Nesting Depth: One of the paper's complexity axes; the number of reasoning steps or sub-questions required to answer the main query
Exploration Level: One of the paper's complexity axes; the degree of open-endedness or underspecification in the user's goal