← Back to Paper List

SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement

Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, S. Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang
Oregon State University, Pennsylvania State University, AG2AI, Inc., Johnson & Johnson
Conference on Empirical Methods in Natural Language Processing (2025)
MM RAG Agent Memory QA

📝 Paper Summary

Document Visual Question Answering (DocVQA) Retrieval-Augmented Generation (RAG)
SimpleDoc replaces complex multi-agent swarms with a streamlined iterative loop that combines visual page embeddings and semantic summaries to retrieve fewer, highly relevant pages for a single reasoning agent.
Core Problem
Existing Multi-modal RAG systems for documents often rely on overcomplicated multi-agent frameworks that retrieve excessive, irrelevant pages, overwhelming the generation model.
Why it matters:
  • Processing long multi-modal documents (reports, manuals) requires accurate cross-referencing between text, tables, and images across distant pages.
  • Retrieving too many pages increases token costs and introduces noise that confuses Vision Language Models (VLMs), leading to hallucinations.
  • Current state-of-the-art methods like MDocAgent use up to 5 specialized agents, making the pipeline brittle and computationally expensive.
Concrete Example: In a 50-page financial report, a question asks to compare a chart on page 5 with a footnote on page 48. Standard retrieval might fetch pages 5-15 based on visual similarity, missing page 48 entirely. A simple reasoner would fail, while MDocAgent might retrieve 20+ pages to compensate, diluting the context window with noise.
Key Novelty
Dual-Cue Retrieval with Iterative Refinement
  • **Dual-Cue Indexing:** Indexes every page in two ways: as a visual embedding (like an image snapshot) and as a concise textual summary generated by a VLM.
  • **Summary-Based Re-ranking:** Uses the text summaries to filter and re-rank visually retrieved pages before showing them to the reasoner, drastically reducing noise.
  • **Iterative Memory:** A single reasoner agent maintains a working memory; if it cannot answer, it updates the query and memory to retrieve only the missing information in the next loop.
Architecture
Architecture Figure Figure 2
The overall pipeline of SimpleDoc, illustrating the offline processing (indexing) and the online iterative QA process.
Evaluation Highlights
  • +10.4% accuracy improvement on LongDocURL benchmark compared to the MDocAgent (top-20) baseline.
  • +3.2% average accuracy gain across 4 datasets (MMLongBench, LongDocURL, PaperTab, FetaTab) while retrieving only ~3.5 pages per query versus 12-20 pages for baselines.
  • Achieves 60.58% accuracy on MMLongBench, outperforming M3DocRAG (41.8%) and MDocAgent (55.3%).
Breakthrough Assessment
7/10
Provides a significant simplification over existing complex multi-agent RAG systems while improving performance. The dual-cue (embedding + summary) approach is a practical, effective engineering contribution.
×