← Back to Paper List

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

(Scale AI) Manasi Sharma, Chen Bo, Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu
Scale AI, University of Maryland, University of Chicago, Washington University, St. Louis, McGill University, University of California, Berkeley
arXiv, 11/2025 (2025)
Agent Benchmark Factuality Reasoning

📝 Paper Summary

Deep Research Agent Evaluation Open-Ended QA Benchmarking
ResearchRubrics provides a rigorous benchmark for Deep Research agents by pairing realistic, open-ended queries with 2,500+ expert-written rubrics to evaluate reasoning, factuality, and synthesis beyond simple Q&A.
Core Problem
Evaluating Deep Research agents is difficult because tasks are open-ended, require multi-step reasoning across diverse documents, and lack single correct answers, making standard QA metrics (like exact match) insufficient.
Why it matters:
  • Current benchmarks focus on short, verifiable facts, failing to capture the long-form synthesis required for real-world research tasks
  • Existing deep research benchmarks often rely on circular LLM-generated rubrics or reference reports, lacking expert human oversight
  • Users engage agents for broad topics (business, consumer queries), but current tests often narrow focus to specific technical domains like academic literature review
Concrete Example: A standard QA benchmark might ask 'What is the band gap of GaN?', expecting a short number. A Deep Research query asks 'Analyze the market viability of GaN semiconductors for EVs over the next decade', requiring synthesis of technical specs, market reports, and supply chain analysis—a task current metrics cannot grade effectively.
Key Novelty
Expert-Authored Fine-Grained Rubrics for Open-Ended Research
  • Pairs 101 diverse research prompts with over 2,500 human-written rubric criteria, avoiding the bias of LLM-generated evaluation standards
  • Introduces a 3-axis complexity framework (Conceptual Breadth, Logical Nesting Depth, Exploration Level) to categorize how 'deep' a research task truly is
  • Implements a ternary grading system (Satisfied, Partially Satisfied, Not Satisfied) for LLM-judges to better capture nuance in long-form answers compared to binary pass/fail
Architecture
Architecture Figure Figure 1 & 2
The data collection pipeline involving three expert participants and the rubric design process.
Evaluation Highlights
  • State-of-the-art agents (OpenAI Deep Research, Gemini Deep Research) achieve under 68% average compliance with expert rubrics, highlighting significant room for improvement
  • Ternary grading (adding 'Partially Satisfied') improves alignment with human experts compared to binary grading
  • Agents struggle most with 'implicit context' and 'inadequate reasoning' rather than just retrieving facts
Breakthrough Assessment
9/10
Addresses a critical gap in agent evaluation by moving beyond simple QA to complex, human-verified rubrics. The manual effort (2,800+ hours) provides a high-quality ground truth that automated benchmarks lack.
×