SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement

📝 Paper Summary

Document Visual Question Answering (DocVQA) Retrieval-Augmented Generation (RAG)

SimpleDoc replaces complex multi-agent swarms with a streamlined iterative loop that combines visual page embeddings and semantic summaries to retrieve fewer, highly relevant pages for a single reasoning agent.

Core Problem

Existing Multi-modal RAG systems for documents often rely on overcomplicated multi-agent frameworks that retrieve excessive, irrelevant pages, overwhelming the generation model.

Why it matters:

Processing long multi-modal documents (reports, manuals) requires accurate cross-referencing between text, tables, and images across distant pages.
Retrieving too many pages increases token costs and introduces noise that confuses Vision Language Models (VLMs), leading to hallucinations.
Current state-of-the-art methods like MDocAgent use up to 5 specialized agents, making the pipeline brittle and computationally expensive.

Concrete Example: In a 50-page financial report, a question asks to compare a chart on page 5 with a footnote on page 48. Standard retrieval might fetch pages 5-15 based on visual similarity, missing page 48 entirely. A simple reasoner would fail, while MDocAgent might retrieve 20+ pages to compensate, diluting the context window with noise.

Key Novelty

Dual-Cue Retrieval with Iterative Refinement

**Dual-Cue Indexing:** Indexes every page in two ways: as a visual embedding (like an image snapshot) and as a concise textual summary generated by a VLM.
**Summary-Based Re-ranking:** Uses the text summaries to filter and re-rank visually retrieved pages before showing them to the reasoner, drastically reducing noise.
**Iterative Memory:** A single reasoner agent maintains a working memory; if it cannot answer, it updates the query and memory to retrieve only the missing information in the next loop.

Architecture

The overall pipeline of SimpleDoc, illustrating the offline processing (indexing) and the online iterative QA process.

Evaluation Highlights

+10.4% accuracy improvement on LongDocURL benchmark compared to the MDocAgent (top-20) baseline.
+3.2% average accuracy gain across 4 datasets (MMLongBench, LongDocURL, PaperTab, FetaTab) while retrieving only ~3.5 pages per query versus 12-20 pages for baselines.
Achieves 60.58% accuracy on MMLongBench, outperforming M3DocRAG (41.8%) and MDocAgent (55.3%).

Breakthrough Assessment

7/10

Provides a significant simplification over existing complex multi-agent RAG systems while improving performance. The dual-cue (embedding + summary) approach is a practical, effective engineering contribution.

⚙️ Technical Details

Problem Definition

Setting: Multi-page, multi-modal Document Visual Question Answering (DocVQA)

Inputs: A multi-page document D (PDF containing text, images, tables) and a natural language query q

Outputs: A natural language answer or 'Not Answerable'

Pipeline Flow

Offline Processing: Group 1: Page Indexing
Online Inference: Group 2: Retrieval & Re-ranking → Group 3: Reasoning & Iteration

System Modules

Visual Encoder (Page Indexing (Offline))

Generate dense visual embeddings for every page in the document

Model or implementation: ColQwen-2.5 (based on ColPali strategy)

Page Summarizer (Page Indexing (Offline))

Generate a concise semantic summary (3-5 sentences) for every page

Model or implementation: Qwen2.5-VL-32B-Instruct

Dual-Cue Retriever

Retrieve top-k pages via embeddings, then re-rank and filter using summaries

Model or implementation: LLM (can be text-only)

Reasoner Agent

Read selected pages and memory to answer, give up, or request new info

Model or implementation: Qwen2.5-VL-32B-Instruct

Novel Architectural Elements

Dual-cue retrieval mechanism: Combining dense visual embeddings (MaxSim) with LLM-based semantic re-ranking over pre-generated summaries.
Lightweight iterative loop: A single reasoner agent that explicitly manages a 'working memory' and generates follow-up queries, replacing complex multi-agent swarms.

Modeling

Base Model: Qwen2.5-VL-32B-Instruct (used for summarization, reasoning, and as the backbone for ColQwen)

Training Method: Inference-only framework using pre-trained models

Key Hyperparameters:

max_iterations_L: Not explicitly reported in the paper (implied small, e.g., 3-5)
retrieval_k: Top-k retrieved by embeddings (e.g., 10 or 30 in experiments)
temperature: 0 (for evaluation)

Compute: Requires VLM inference (Qwen2.5-VL-32B). Offline indexing requires one pass per document page.

Comparison to Prior Work

vs. MDocAgent: SimpleDoc uses a single iterative loop instead of 5 specialized agents, reducing complexity and retrieved page count while improving accuracy.
vs. M3DocRAG: SimpleDoc adds a semantic re-ranking step (using summaries) and an iterative refinement loop, whereas M3DocRAG is a single-pass retrieve-and-generate pipeline.
vs. VisRAG [not cited in paper]: SimpleDoc explicitly incorporates semantic text summaries for re-ranking, whereas VisRAG relies primarily on visual embeddings.

Limitations

Dependency on VLM quality: The framework relies heavily on Qwen2.5-VL's ability to summarize and reason accurately.
Offline Indexing Cost: Generating a summary for every page in a large corpus using a 32B VLM can be computationally expensive upfront.
Latency: The iterative nature means difficult questions may require multiple sequential VLM calls, increasing inference time compared to single-pass RAG.

Reproducibility

Code: https://github.com/ag2ai/SimpleDoc

Code is publicly available at https://github.com/ag2ai/SimpleDoc. The method relies on open-weights models (Qwen2.5-VL, ColQwen-2.5). Exact prompt templates for the summarizer and reasoner are not explicitly detailed in the main text but likely available in the repo.

📊 Experiments & Results

Evaluation Setup

Zero-shot document visual question answering across four datasets.

Benchmarks:

MMLongBench (Long-context multi-modal document reasoning)
LongDocURL (Large-scale document retrieval and reasoning)
PaperTab (Tabular data extraction from academic papers)
FetaTab (Table-based QA from Wikipedia)

Metrics:

Binary Correctness (Accuracy)
Number of Retrieved Pages
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against state-of-the-art multi-agent and RAG baselines showing SimpleDoc's superior accuracy and efficiency.
MMLongBench	Accuracy	54.8	60.58	+5.78
LongDocURL	Accuracy	61.9	72.30	+10.40
PaperTab	Accuracy	63.1	65.39	+2.29
FetaTab	Accuracy	84.1	82.19	-1.91
Average across 4 datasets	Accuracy	66.9	70.12	+3.22

Main Takeaways

Simplicity outperforms complexity: A single-agent iterative loop beats a 5-agent swarm (MDocAgent) by focusing on high-quality retrieval rather than quantity.
Dual-Cue retrieval is highly effective: Combining visual embeddings (good for layout/figures) with semantic summaries (good for dense text) significantly boosts retrieval precision.
Iterative refinement allows the model to 'read' the document more naturally, pulling in new pages only when necessary, which keeps the context window clean.
The method works best on long, complex documents (LongDocURL, MMLongBench) where standard retrieval fails to capture cross-page dependencies.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Vision Language Models (VLMs)
Vector Embeddings
Document Layout Analysis

Key Terms

DocVQA: Document Visual Question Answering—answering questions based on visual documents like PDFs

ColPali: A model architecture that treats document pages as images to generate multi-vector embeddings for retrieval

VLM: Vision Language Model—an AI model capable of processing and reasoning over both images and text

RAG: Retrieval-Augmented Generation—fetching relevant data from external sources to ground LLM responses

MaxSim: A similarity scoring mechanism (from ColBERT) used to find the best matching pages based on vector embeddings

Zero-shot: Using a pre-trained model to perform a task without any specific training examples for that task