ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

📝 Paper Summary

Deep Research Agent Evaluation Open-Ended QA Benchmarking

ResearchRubrics provides a rigorous benchmark for Deep Research agents by pairing realistic, open-ended queries with 2,500+ expert-written rubrics to evaluate reasoning, factuality, and synthesis beyond simple Q&A.

Core Problem

Evaluating Deep Research agents is difficult because tasks are open-ended, require multi-step reasoning across diverse documents, and lack single correct answers, making standard QA metrics (like exact match) insufficient.

Why it matters:

Current benchmarks focus on short, verifiable facts, failing to capture the long-form synthesis required for real-world research tasks
Existing deep research benchmarks often rely on circular LLM-generated rubrics or reference reports, lacking expert human oversight
Users engage agents for broad topics (business, consumer queries), but current tests often narrow focus to specific technical domains like academic literature review

Concrete Example: A standard QA benchmark might ask 'What is the band gap of GaN?', expecting a short number. A Deep Research query asks 'Analyze the market viability of GaN semiconductors for EVs over the next decade', requiring synthesis of technical specs, market reports, and supply chain analysis—a task current metrics cannot grade effectively.

Key Novelty

Expert-Authored Fine-Grained Rubrics for Open-Ended Research

Pairs 101 diverse research prompts with over 2,500 human-written rubric criteria, avoiding the bias of LLM-generated evaluation standards
Introduces a 3-axis complexity framework (Conceptual Breadth, Logical Nesting Depth, Exploration Level) to categorize how 'deep' a research task truly is
Implements a ternary grading system (Satisfied, Partially Satisfied, Not Satisfied) for LLM-judges to better capture nuance in long-form answers compared to binary pass/fail

Architecture

The data collection pipeline involving three expert participants and the rubric design process.

Evaluation Highlights

State-of-the-art agents (OpenAI Deep Research, Gemini Deep Research) achieve under 68% average compliance with expert rubrics, highlighting significant room for improvement
Ternary grading (adding 'Partially Satisfied') improves alignment with human experts compared to binary grading
Agents struggle most with 'implicit context' and 'inadequate reasoning' rather than just retrieving facts

Breakthrough Assessment

9/10

Addresses a critical gap in agent evaluation by moving beyond simple QA to complex, human-verified rubrics. The manual effort (2,800+ hours) provides a high-quality ground truth that automated benchmarks lack.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of autonomous agents on open-ended, multi-step research tasks

Inputs: Natural language research query q (e.g., business planning, historical analysis)

Outputs: Long-form research report or answer synthesized from web/document retrieval

Pipeline Flow

Prompt Selection (101 tasks across 9 domains)
Rubric Creation (Experts draft criteria)
Rubric Review (Iterative human review pipeline)
Agent Execution (Target agents generate responses)
Evaluation (LLM-as-a-judge grades responses against rubrics)

System Modules

Task Complexity Framework

Categorizes queries to ensure benchmark diversity

Rubric Criteria

Defines grading logic for each prompt

Model-Based Grader

Scores agent responses against rubric criteria

Model or implementation: Powerful LLM (implied GPT-4 class, exact model for judge not explicitly named in snippet)

Novel Architectural Elements

Tri-axial complexity framework (Breadth, Depth, Ambiguity) explicitly integrated into dataset construction to ensure coverage of reasoning types
Integration of negative rubric weights (penalties) alongside positive weights in a structured grading formula
Human-in-the-loop rubric generation pipeline with three distinct expert roles (Proposer, Reviewer, Final Approver) to prevent automation bias

Comparison to Prior Work

vs. HLE: Focuses on long-form, open-ended synthesis rather than short-answer factual correctness
vs. DeepResearch Bench: Uses fully human-written and reviewed rubrics to avoid circularity/anchoring bias inherent in LLM-generated rubrics
vs. DeepScholar-Bench: Covers 9 diverse domains (consumer, business, history) rather than just academic technical writing
+ 2 more
vs. ReportBench: Evaluates reasoning process via fine-grained criteria rather than just textual overlap with a reference document
vs. ExpertLongBench [not cited in paper]: Comparison text in paper mentions ExpertLongBench relies on existing references limiting scope to academic/professional, whereas ResearchRubrics includes general consumer queries and has higher rubric density.

Limitations

Evaluation relies on an LLM-as-a-judge proxy, which may still have some misalignment with human experts despite calibration
The benchmark size (101 prompts) is relatively small compared to automated datasets, limited by the high cost of human annotation
Rubrics are specific to the prompts provided; extending the benchmark requires significant manual effort to write new rubrics

Reproducibility

Code: https://scale.com/research/researchrubrics

publicly available (https://scale.com/research/researchrubrics). The release includes all 101 prompts, the full set of 2,593 expert-written rubrics, and the evaluation code. The specific prompts and rubrics are the core artifact.

📊 Experiments & Results

Evaluation Setup

LLM-as-a-judge scoring of agent outputs against human-written rubrics

Benchmarks:

ResearchRubrics (Deep Research (Open-ended, multi-step web research)) [New]

Metrics:

Rubric Compliance Score (weighted sum of satisfied criteria)
Macro F1 (alignment between human and model graders)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ResearchRubrics	Average Compliance	100	68	-32

Experiment Figures

Distribution of the 101 prompts across 9 domain categories (e.g., Business, History, AI/ML)

Main Takeaways

Current SOTA agents (OpenAI, Gemini, Perplexity) fail to meet over 30% of expert criteria on average, indicating deep research is far from solved
Failures are primarily driven by missed implicit context and inadequate reasoning, rather than just failure to retrieve documents
Ternary grading schemes allow for more nuanced evaluation of partial success in complex research tasks compared to binary metrics
The complexity framework reveals that models may perform differently depending on whether a task requires high breadth (many sources) vs. high depth (complex reasoning)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG) and autonomous agents
Familiarity with LLM-as-a-judge evaluation paradigms
Basic knowledge of precision/recall and F1 scoring

Key Terms

Deep Research (DR): Autonomous LLM-based systems that conduct multi-step web exploration, targeted retrieval, and synthesis to answer open-ended queries

LLM-as-a-judge: Using a strong LLM to evaluate the outputs of other models based on specific criteria

Rubric: A set of specific criteria used to grade subjective or complex work; here, expert-written rules for what a good answer must contain

Ternary Grading: A grading scale with three values (Satisfied, Partially Satisfied, Not Satisfied) rather than just Pass/Fail

Macro F1: A metric that calculates the F1 score (harmonic mean of precision and recall) for each class independently and then takes the average, treating all classes equally

Anchoring Bias: Cognitive bias where reliance on an initial piece of information (e.g., an LLM-generated rubric) heavily influences subsequent judgments

Conceptual Breadth: One of the paper's complexity axes; the number and diversity of distinct topics or domains involved in a query

Logical Nesting Depth: One of the paper's complexity axes; the number of reasoning steps or sub-questions required to answer the main query

Exploration Level: One of the paper's complexity axes; the degree of open-endedness or underspecification in the user's goal