Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research

📝 Paper Summary

Deep Research Agentic Information Retrieval

Super Research is a rigorous benchmark for evaluating agentic deep research capabilities, requiring long-horizon planning and massive evidence synthesis to answer super-complex questions.

Core Problem

Current LLM evaluation paradigms fail to measure proficiency in solving highly complex, long-horizon research tasks that require synthesizing massive, conflicting evidence across hundreds of web pages.

Why it matters:

Existing benchmarks prioritize atomic fact recall but neglect the sophisticated synthesis required for professional intelligence or scientific discovery
Standard LLM-as-a-judge metrics often align poorly with deep reasoning quality, rewarding false confidence over necessary uncertainty expression
Deep Research agents need a 'ceiling protocol' to stress-test limitations in reasoning consistency and context management that simpler tasks don't trigger

Concrete Example: A question like 'optimizing immunopharmacological mechanisms where T cell activation must be balanced against tumor microenvironment immune escape' requires 100+ retrieval steps and synthesizing 1000+ pages, far exceeding the 10-20 steps of standard Wide/Deep Search.

Key Novelty

Super Research Benchmark & Graph-Anchored Auditing

Defines a new tier of 'Super Research' requiring structured decomposition, super wide retrieval for diverse perspectives, and super deep investigation for uncertainty resolution
Constructs a benchmark of 300 expert-written questions using a 'Cognitive-Rank Constrained Expert Simulation' followed by human verification
Introduces a graph-anchored evaluation protocol where generated reports are projected onto expert-curated Knowledge Graphs to measure depth, logic, and objectivity

Architecture

The expert-driven construction pipeline and the interactions between Planner, Researcher, Summarizer, and Writer agents

Evaluation Highlights

SOTA system (Gemini Deep Research) achieves only 28.62 Overall Score, confirming the high difficulty ceiling of the benchmark
Native Search-Integrated Agents like Kimi-k2 (26.16) outperform some specialized Deep Research systems and search-augmented baselines
Standard agentic baselines (e.g., DeepSeek-r1 with Tavily) lag significantly, clustering in the 16-23 score range

Breakthrough Assessment

9/10

Establishes a critical 'ceiling' benchmark for the emerging field of Deep Research agents. The graph-anchored evaluation methodology is a significant advancement over standard LLM-as-a-judge approaches.

⚙️ Technical Details

Problem Definition

Setting: Open-ended, long-horizon research question answering

Inputs: Super-complex research query q

Outputs: Comprehensive research report with verifiable citations and intermediate artifacts (outlines, tables)

Pipeline Flow

Planner (Decomposes query into Research Graph)
Researcher (Executes sub-tasks, retrieves info)
Summarizer (Synthesizes results into Dynamic Memory)
Writer (Constructs final report based on Research Graph)

System Modules

Planner

Decomposes root topic into a structured DAG of research tasks organized into phases and chapters

Model or implementation: LLM-based agent (e.g., GPT-4o)

Researcher (Execution & Retrieval)

Executes sub-tasks in dependency-aware sequence, retrieving information

Model or implementation: LLM-based agent

Summarizer (Execution & Retrieval)

Synthesizes researcher results into Dynamic Memory to evolve global context

Model or implementation: LLM-based agent

Writer

Iteratively constructs the manuscript section by section using the structured research graph

Model or implementation: LLM-based agent

Novel Architectural Elements

Graph-anchored auditing protocol using projection of generated reports onto expert-curated knowledge graphs
Hierarchical task decomposition coupled with dynamic memory injection for progressive context evolution

Modeling

Base Model: Various (GPT-4o, Gemini, Sonar, etc. evaluated)

Comparison to Prior Work

vs. Deep Research: Super Research targets 100+ retrieval steps and 1000+ pages vs typical 10-20 iterations/100 pages
vs. GPQA/GAIA: Focuses on long-horizon synthesis and report generation rather than short-answer or multiple-choice accuracy
vs. Standard RAG: Requires resolving conflicting evidence and synthesizing across heterogeneous sources rather than simple fact retrieval

Limitations

Super-complex questions are infrequent in routine consumer applications
Evaluation relies heavily on expert-curated graphs which are costly to scale
Benchmark construction involved human-in-the-loop, potentially introducing annotator bias

Reproducibility

Leaderboard and benchmark details available at https://cnsdqd-dyb.github.io/Super-Research-Benchmark/. Specific prompt templates or model weights for the dataset construction pipeline are not explicitly linked in the provided text.

📊 Experiments & Results

Evaluation Setup

Complex research report generation evaluated against expert-curated Research Graphs

Benchmarks:

Super Research Benchmark (Long-horizon autonomous research) [New]

Metrics:

Overall Score
Coverage and Comprehension (Depth-Weighted Recall)
Logical Consistency
Report Utility
Objectivity Score
Citation Health
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of Deep Research Systems shows they generally lead but struggle with the absolute ceiling of the benchmark.
Super Research Benchmark	Overall Score	27.04	28.62	+1.58
Super Research Benchmark	Overall Score	25.74	28.62	+2.88
Comparison between Native Search-Integrated Agents and Search-Augmented Baselines.
Super Research Benchmark	Overall Score	Not reported in the paper	26.16	Not reported in the paper
Super Research Benchmark	Overall Score	23.00	26.16	+3.16

Main Takeaways

Current SOTA models reach <30/100 Overall Score, confirming super-complex queries remain an unsolved frontier
Retrieval breadth positively correlates with reasoning depth
Native Search-Integrated Agents (e.g., Kimi-k2) can outperform framework-assembled agents (e.g., LangGraph + LLM), highlighting the value of native integration
Standard LLM-as-a-judge metrics are insufficient for this complexity; the graph-anchored protocol provides necessary granularity

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Retrieval-Augmented Generation (RAG)
Understanding of Agentic workflows (planning, tool use)
Knowledge of Knowledge Graph structures

Key Terms

Deep Research: An agentic capability prioritizing vertical exploration, iteratively following chains of evidence to resolve nuanced questions

Wide Search: A paradigm prioritizing horizontal information coverage to capture an exhaustive array of information nodes

Super Research: A proposed task tier coupling structured decomposition, super wide retrieval (diverse perspectives), and super deep investigation (iterative uncertainty resolution)

Research Graph: A structured DAG (Directed Acyclic Graph) representation of the research task, linking atomic facts, key insights, and global insights

Atomic Facts Level: Graph nodes representing specific data points anchored to URLs

Citation Health: A diagnostic metric assessing source diversity to flag single-source dependency or narrative monopolization

Objectivity Score: A metric quantifying a model's ability to maintain multi-perspective balance and calibrate stance against inherent ambiguity

Logical Consistency: A metric assessing whether global conclusions are algorithmically grounded in atomic facts via unbroken citation chains