UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

📝 Paper Summary

Web agents Multi-agent systems

UIS-Digger addresses the inability of current agents to access unindexed web content by employing a multi-agent framework with dual-mode browsing and fine-tuning on a mix of real and simulated interaction data.

Core Problem

Current information-seeking agents rely heavily on search engine indices (Indexed Information Seeking), failing to retrieve vital information hidden in unindexed pages, files, or dynamic web elements.

Why it matters:

Search engines cannot index deep web content, dynamic forms, or obscure files, leaving a critical blind spot for AI agents
Existing benchmarks (GAIA, BrowseComp) do not distinguish between indexed and unindexed tasks, masking the severity of agent failure in real-world exploration
Sole reliance on search APIs limits agents to surface-level information, preventing them from solving complex tasks like verifying flight prices or parsing specific corporate reports

Concrete Example: For a UIS question requiring specific historical data, a standard agent uses Google Search and fails because the answer is inside a downloadable Excel file or behind a date-picker widget, whereas UIS-Digger navigates the site, interacts with the widget, and parses the file.

Key Novelty

UIS-Digger Multi-Agent Framework

Formalizes 'Unindexed Information Seeking' (UIS) as a distinct problem where answers exist on the web but are not retrievable via search engine snippets
Introduces a dual-mode web surfer that shares memory between textual (HTML) and visual (screenshot) modes to handle both efficient reading and complex visual interactions
Utilizes a training pipeline mixing real-world deep web exploration and 'virtual websites' (simulated environments like booking systems) to bootstrap interactive capabilities

Architecture

The UIS-Digger architecture, detailing the multi-agent collaboration between Planner, Searcher, Surfer, and Reader

Evaluation Highlights

Achieves state-of-the-art performance of 27.27% on the newly introduced UIS-QA benchmark, surpassing baselines integrated with GPT-4.1 and O3
Demonstrates that even top agents suffer massive performance drops on UIS tasks (e.g., from 70.90% on GAIA to ~25% on UIS-QA), highlighting the benchmark's difficulty
Outperforms the strongest baseline by +1.82 percentage points using a significantly smaller ~30B parameter backbone model compared to proprietary giants

Breakthrough Assessment

8/10

Identifies a critical, under-explored gap in agentic web search (UIS) and provides both a rigorous benchmark and a novel architectural solution that beats larger models.

⚙️ Technical Details

Problem Definition

Setting: Information seeking where the target information resides in the set of unindexed webpages or files, not accessible via search engine snippets

Inputs: Natural language user query Q

Outputs: Deterministic short-form answer z

Pipeline Flow

Planner: Decomposes query → sub-tasks
Execution Group: Web Searcher / Web Surfer / File Reader (parallel execution)
Planner: Synthesizes final answer

System Modules

Planner

Decompose user query, coordinate subordinate agents, and deliver final answer

Model or implementation: Fine-tuned ~30B LLM

Web Searcher

Retrieve indexed information using search engines and crawl tools

Model or implementation: Fine-tuned ~30B LLM

Web Surfer

Navigate unindexed pages via dual-mode (text/visual) interactions

Model or implementation: Fine-tuned ~30B LLM

File Reader

Parse and extract content from downloaded files (PDF, XLSX, DOCX)

Model or implementation: Fine-tuned ~30B LLM

Novel Architectural Elements

Dual-mode memory-shared browsing: The Web Surfer maintains a single consistent state history while dynamically switching between HTML-text mode (efficiency) and Visual-screenshot mode (complex layout understanding)
Integration of virtual website simulators into the training pipeline to bootstrap interactive capabilities (e.g., date pickers) before real-world deployment

Modeling

Base Model: ~30B parameter LLM

Training Method: Two-stage tuning: Supervised Fine-Tuning (SFT) followed by Rejection Sampling Fine-Tuning (RFT)

Adaptation: Full fine-tuning (implied by context of SFT/RFT on backbone)

Training Data:

Real-world data: 100+ base websites, agent extracts info, LLM generates QA pairs
Virtual data: 3 types of simulated websites (flight booking, stats) with JSON databases to force widget interaction learning
Judge model filters subjective/ambiguous questions

Key Hyperparameters:

sft_temperature: 0
rft_temperature: 0.4
rft_sampling_group_size: 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. Generalist Agents: UIS-Digger uses specialized sub-agents and dual-mode browsing specifically for unindexed content, whereas generalists often rely on search snippets
vs. WebArena: UIS-Digger focuses on information seeking (finding answers) rather than just successful task execution/navigation state
vs. RAG-based search: Explicitly handles 'unindexed' content that RAG systems (dependent on pre-indexed corpora) cannot see [not cited in paper]

Limitations

Relies on external search engines (Google Serper) as an entry point, inheriting their biases or downtime risks
Evaluation set is relatively small (110 QA pairs) compared to general benchmarks
Performance is still low (27.27%), indicating UIS remains a largely unsolved problem

Reproducibility

Benchmark (UIS-QA) comprising 110 expert-annotated QA pairs is introduced. Code and model weights availability is not explicitly stated (URL is missing). Method relies on proprietary/closed search engines (Google Serper) and potentially closed judge models (DeepSeek-R1 mentioned as offline filter, z.ai as verifier).

📊 Experiments & Results

Evaluation Setup

Open-ended web information seeking starting from Google Search to answer factual questions

Benchmarks:

UIS-QA (Unindexed Information Seeking) [New]
GAIA (General AI Assistant tasks)
BrowseComp-zh (Web browsing competency)

Metrics:

Accuracy (Exact match or rule-based verification)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons on the new UIS-QA benchmark show UIS-Digger outperforming strong baselines, despite all models struggling compared to standard benchmarks.
UIS-QA	Accuracy	25.45	27.27	+1.82
GAIA vs UIS-QA	Accuracy	70.90	24.55	-46.35

Experiment Figures

Contrast between Indexed Information Seeking (IIS) and Unindexed Information Seeking (UIS)

Main Takeaways

Current state-of-the-art agents suffer a massive performance drop (over 40%) when shifting from indexed (GAIA) to unindexed (UIS-QA) tasks
Specialized training with simulated 'virtual websites' (for widget interactions) and RFT is crucial for handling complex web elements
Dual-mode browsing (text + visual) allows the agent to navigate deeper and more effectively than single-mode baselines
The low ceiling (<30%) on UIS-QA suggests that unindexed information seeking is a significantly harder and distinct problem from standard search

📚 Prerequisite Knowledge

Prerequisites

Understanding of agentic workflows (ReAct)
Web browsing technologies (DOM, screenshots)
Basic LLM fine-tuning concepts (SFT, RFT)

Key Terms

UIS: Unindexed Information Seeking—finding info not captured by search engine crawlers (e.g., in files, dynamic pages)

IIS: Indexed Information Seeking—retrieving info directly from search engine indices/snippets

SFT: Supervised Fine-Tuning—training the model on expert demonstrations

RFT: Rejection Sampling Fine-Tuning—generating multiple outputs, keeping only correct ones, and retraining the model on them

ReAct: Reason+Act—a paradigm where agents generate a reasoning trace before executing an action

Dual-mode browsing: A strategy allowing the agent to switch between reading HTML text and analyzing visual screenshots while maintaining shared memory