Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

📝 Paper Summary

Agentic RAG pipeline Benchmark datasets Document understanding

MADQA introduces a benchmark of 2,250 human-authored questions over heterogeneous PDFs to reveal that while current agents can match human accuracy, they rely on inefficient brute-force search rather than strategic planning.

Core Problem

Existing document QA benchmarks either lack visual complexity (text-only), are restricted to single domains (finance), or suffer from data integrity issues (LLM-generated questions, recycled documents), failing to test true agentic planning.

Why it matters:

Businesses need agents to handle complex workflows over real-world PDFs (reports, contracts), not just plain text or single images
Current evaluations conflate retrieval quality with generation skill and don't measure the 'cost' of reasoning
There is a critical need to distinguish between agents that truly 'plan' and those that just randomly search until they get lucky

Concrete Example: A user asks: 'Which lesson plan suggests a lower instructor ratio: Firearms or New Mexico Justice System?' An agent must retrieve two distinct documents, extract the ratios from potentially different page layouts, and compare them. Current agents often fail to stop searching after finding the answer, or get stuck in loops, wasting compute.

Key Novelty

MADQA: Multimodal Agentic Document QA Benchmark

A dataset of 2,250 questions over 800 fresh, heterogeneous PDFs, designed using Classical Test Theory to maximize discriminative power between model abilities
A novel 'accuracy-effort trade-off' evaluation protocol using the Kuiper statistic to measure whether an agent's computational spend correlates with its success rate
Strict 'agentic' design where questions require multi-hop reasoning across pages/docs that cannot be solved by a single retrieval query or external knowledge

Architecture

Conceptual workflow of the Agentic Document Collection QA task

Evaluation Highlights

Gemini 3 Pro BM25 Agent achieves 82.2% accuracy, matching human search performance (82.2%) but lagging behind human oracle performance (99.4%)
Humans achieve 50% accuracy on their very first query (zero-shot calibration), while Gemini 3 Pro starts at ~12% and relies on expensive iterative recovery
Simple agents with constrained search tools (BM25) outperform expensive, unconstrained Recursive Language Models (RLMs) while avoiding catastrophic cost overheads ($850 for one RLM run)

Breakthrough Assessment

9/10

Sets a new standard for agentic document benchmarking with rigorous human annotation and a novel focus on efficiency/calibration rather than just raw accuracy.

⚙️ Technical Details

Problem Definition

Setting: Agentic Document Collection Visual Question Answering

Inputs: Corpus C of multi-page documents and a natural language query q

Outputs: Answer a and a minimal evidence set E ⊆ C

Pipeline Flow

User Query -> Agent Loop (Thought -> Action -> Observation) -> Final Answer + Evidence

System Modules

BM25 Search Tool

Provides text-based search capabilities over the document corpus

Model or implementation: BM25 (algorithm)

Vision-Language Planner

Formulates search queries, analyzes visual page content, and aggregates findings

Model or implementation: Evaluated Models (e.g., Gemini 3 Pro, GPT-5, Claude 3.5 Sonnet)

Novel Architectural Elements

Evaluation Harness: A specialized environment that logs every search query, page view, and timestamp to calculate 'effort' alongside accuracy
Sentinel Pool Design: Explicit reservation of unsolvable items (20%) within the test set to maintain discriminatory power over time

Modeling

Base Model: Various (Gemini 3 Pro, GPT-5, Claude Sonnet 4.5, etc.)

Comparison to Prior Work

vs. DocVQA: MADQA involves 800+ multi-page documents requiring retrieval and navigation, not just reading a single given image
vs. ViDoRE: MADQA uses 100% human-authored questions and fresh documents to avoid data contamination and synthetic bias
vs. RAG services: MADQA evaluates the 'agentic' ability to plan and refine queries, not just static retrieval performance
+ 1 more
vs. GAIA [not cited in paper]: GAIA tests general tool use; MADQA focuses specifically on the visual/textual document navigation and evidence attribution domain

Limitations

Corpus limited to English-language and predominantly U.S.-centric documents
Public documents may have been seen during pre-training (though 'guessability' analysis bounds this at ~11%)
Page-level evidence granularity may be too coarse for some fine-grained extraction tasks (e.g., specific table cells)
Evaluation relies on LLM-as-a-judge (though calibrated to 0.88 Cohen's kappa with humans)

Reproducibility

Code: https://github.com/Snowflake-Labs/MADQA

Publicly available: Dataset (MADQA), evaluation harness code. Reproducible baselines: BM25 MLLM Agent, Claude Agent with Semtools. Missing: Weights for proprietary models (GPT-5, Gemini 3 Pro) are obviously not available.

📊 Experiments & Results

Evaluation Setup

Agentic retrieval and QA over a corpus of 800 PDF documents

Benchmarks:

MADQA (Agentic Document Collection VQA) [New]

Metrics:

Accuracy (LLM-judge)
Page F1 (Evidence Retrieval)
Doc F1 (Document Retrieval)
Kuiper Statistic (Effort Calibration)
Statistical methodology: Classical Test Theory for split creation; Cohen's kappa for judge agreement; Confidence intervals reported for accuracy

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Top-performing agentic systems outperform static RAG baselines, but significant gaps remain compared to human performance.
MADQA	Accuracy	78.6	82.2	+3.6
MADQA	Accuracy	99.4	82.2	-17.2
MADQA	Kuiper Statistic (lower is better)	14.6	25.8	+11.2
MADQA	Page F1	70.1	78.5	+8.4
MADQA	Total Cost (USD) for test set	849	25.2	-823.8

Experiment Figures

The Classical Test Theory selection process for the benchmark splits

Accuracy as a function of step limit (N)

Main Takeaways

Agentic systems (iterative search) consistently outperform static RAG (single-step), validating the need for planning in complex document QA.
Models and humans fail on different questions: humans struggle with attention fatigue (extraction errors), while models struggle with retrieval planning.
The 'Oracle Gap' of ~18% indicates that retrieval, not reasoning, is the primary bottleneck; giving models perfect context would solve most remaining errors.
Agents suffer from poor 'Cold Start' performance: they need many steps to reach accuracy levels that humans achieve in just 1-2 queries.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Familiarity with agentic workflows (tool use, planning)
Basic statistics (correlation, distribution analysis)

Key Terms

Classical Test Theory: A psychometric framework used here to select benchmark questions that best discriminate between strong and weak models while reserving hard items to prevent saturation

Kuiper statistic: A metric measuring the deviation between two cumulative distributions; used here to quantify how well an agent's effort (step count) aligns with its probability of success

Sentinel Pool: A reserved 20% subset of the test set containing items no current model can solve, ensuring the benchmark remains relevant as models improve

Page F1: A metric measuring the overlap between the set of pages cited by the agent and the human-annotated minimal evidence set

RLM: Recursive Language Models—a system where an LLM recursively writes code to process document collections, often unconstrained

BM25: A standard probabilistic information retrieval function used to rank documents based on query term frequency

Cold Start: The phenomenon where agents have very low accuracy on their initial attempt/query compared to humans, requiring many iterations to recover

Agentic property: The condition where no single retrieval query exists that can surface all necessary evidence, necessitating iterative planning

Multi-hop: Questions requiring information aggregation from multiple disjoint pages or documents

Closed-World: Constraints where the answer must be derived solely from the provided corpus, excluding external parametric knowledge