Visrag: Vision-based retrieval-augmented generation on multi-modality documents

📝 Paper Summary

Modularized RAG pipeline Multi-modal RAG

VisRAG replaces text-based RAG components with vision-language models that process document pages directly as images, eliminating parsing errors and preserving layout information.

Core Problem

Traditional RAG requires parsing multi-modal documents (PDFs) into text, a process that loses visual layout information and introduces OCR errors, degrading retrieval and generation quality.

Why it matters:

Real-world knowledge often exists in complex documents (textbooks, manuals) where text and figures are interleaved, making text-only extraction insufficient
Parsing pipelines involving layout recognition and OCR are prone to cascading errors that cannot be recovered in later stages
Current multi-modal approaches typically rely on pre-extracted image-caption pairs, failing to handle raw document pages where modalities are mixed

Concrete Example: When answering a question about a chart in a PDF, a text-based RAG system might fail to extract the chart's data or caption correctly during parsing, leading the retriever to miss the page entirely or the generator to hallucinate an answer. VisRAG sees the chart pixels directly.

Key Novelty

Dual-Stage Vision-Based RAG (VisRAG)

Treats the document page image as the fundamental unit for both retrieval and generation, bypassing OCR/parsing completely
Uses a VLM (Vision-Language Model) as a dense retriever by encoding page images into embeddings via weighted mean pooling
Generates answers using a VLM that reads retrieved page images, employing concatenation or weighted selection to handle multiple pages

Architecture

Comparison of TextRAG vs. VisRAG pipelines. TextRAG involves PDF parsing, text encoding, and LLM generation. VisRAG encodes document images directly for retrieval and feeds images to a VLM for generation.

Evaluation Highlights

+39.7% improvement over TextRAG baseline on multimodal document QA when using MiniCPM-V 2.6 as the generator
+20% improvement over TextRAG baseline when using GPT-4o as the generator, demonstrating benefits even with powerful closed-source models
VisRAG-Ret (vision retriever) outperforms state-of-the-art text retrievers (BGE, GTE) and vision retrievers (SigLIP) on diverse benchmarks like InfographicsVQA and SlideVQA

Breakthrough Assessment

8/10

Strong conceptual shift from text-parsing to pure-vision processing for RAG. Significant performance gains justify the approach, though computational cost of processing images is a potential hurdle.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation over a corpus of multi-modal document pages

Inputs: Natural language query q and a corpus of document images D

Outputs: Generated textual answer a

Pipeline Flow

Group: Retrieval
VisRAG-Ret (Encodes Query Text & Document Images) → Top-k Pages
Group: Generation
VisRAG-Gen (Takes Query + Top-k Images) → Answer

System Modules

VisRAG-Ret

Retrieve relevant document images based on text query

Model or implementation: MiniCPM-V 2.0 (initialized from SigLIP-400M and MiniCPM-2.4B)

VisRAG-Gen

Generate answer from query and retrieved images

Model or implementation: MiniCPM-V 2.6 (supports multi-image input)

Novel Architectural Elements

Replacement of text encoder with VLM-based image encoder for document representation in dense retrieval
Direct injection of raw document images into the generator context, bypassing intermediate text representation

Modeling

Base Model: MiniCPM-V 2.0 (Retriever), MiniCPM-V 2.6 (Generator)

Training Method: Contrastive Learning (InfoNCE) for Retriever

Objective Functions:

Purpose: Optimize retriever to align query text embeddings with relevant document image embeddings.

Formally: InfoNCE loss minimizing negative log-likelihood of positive pairs relative to in-batch negatives.

Training Data:

Synthesized query-document pairs from web-crawled PDFs (367k pairs)
Open-source VQA datasets (MP-DocVQA, ArXivQA, ChartQA, etc.)
Filtered context-dependent queries using GPT-4o

Key Hyperparameters:

learning_rate: 2e-5 (Retriever)
batch_size: 128
epochs: 3
+ 2 more
max_length: 512
temperature: 0.02

Compute: Retriever training: 8x H800 GPUs for ~7 hours. Inference: Not explicitly reported in detail, but notes higher latency than text-only due to vision encoding.

Comparison to Prior Work

vs. TextRAG: VisRAG eliminates parsing; processes images directly
vs. ColPali [not cited in paper]: ColPali uses late interaction (ColBERT-style) on patches; VisRAG uses dense embedding (bi-encoder) on full images. VisRAG also evaluates end-to-end generation.
vs. UniIR: VisRAG focuses on document pages with interleaved text/images rather than clean image-caption pairs

Limitations

High computational cost for encoding and processing images compared to text embeddings
Performance depends on the resolution and visual encoding capabilities of the VLM
Currently limited by the context window of VLMs when handling many retrieved pages

Reproducibility

Code: https://github.com/openbmb/visrag

Code and data available at https://github.com/openbmb/visrag. Uses open-source MiniCPM-V models. Synthetic data generation relied on GPT-4o (closed source).

📊 Experiments & Results

Evaluation Setup

End-to-end RAG on multimodal document datasets

Benchmarks:

MP-DocVQA (Industrial document QA)
InfographicsVQA (Infographic QA)
ArXivQA (Scientific paper figure QA)
ChartQA, PlotQA (Chart and Plot QA)
SlideVQA (Presentation slide QA (multi-hop))

Metrics:

Recall@10
MRR@10
Accuracy (relaxed exact match)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval performance comparisons showing VisRAG-Ret outperforms text-only and other vision-based baselines across most datasets.
Average across all datasets	Recall@10	46.2	56.4	+10.2
InfographicsVQA	Recall@10	26.3	48.2	+21.9
Generation performance shows VisRAG pipeline outperforms TextRAG pipeline regardless of the generator model used.
Average across all datasets	Accuracy	34.5	48.2	+13.7
Average across all datasets	Accuracy	45.0	54.0	+9.0

Experiment Figures

Bar chart comparing TextRAG and VisRAG end-to-end accuracy using different generators (MiniCPM-V 2.6 and GPT-4o).

Performance curves scaling with the number of retrieved documents (k) for TextRAG vs VisRAG.

Main Takeaways

VisRAG consistently outperforms TextRAG, with larger gains on visually intensive datasets (InfographicsVQA) compared to text-heavy ones (MP-DocVQA).
The 'cascade effect' of parsing errors in TextRAG is eliminated; VisRAG preserves information by keeping documents as images.
Multi-image generation capability (MiniCPM-V 2.6) allows VisRAG to benefit from retrieving more pages (top-k=3 vs top-k=1), unlike single-image models which saturate or degrade.
VisRAG exhibits strong data efficiency, outperforming text baselines even when trained on significantly less data (e.g., vs BGE-m3 trained on massive corpora).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG) pipelines
Familiarity with Vision-Language Models (VLMs) like MiniCPM-V or CLIP
Basic knowledge of dense retrieval and contrastive learning (InfoNCE loss)

Key Terms

VLM: Vision-Language Model—a model capable of processing both image and text inputs to generate text or embeddings

VisRAG-Ret: The retriever component of VisRAG that encodes queries (text) and documents (images) into a shared embedding space

VisRAG-Gen: The generator component of VisRAG that takes the query and retrieved document images as input to generate an answer

InfoNCE loss: A contrastive loss function used to train the retriever to pull positive query-document pairs closer and push negatives apart

OCR: Optical Character Recognition—technology to convert images of text into machine-encoded text

TextRAG: The traditional RAG pipeline that relies on parsing documents into text segments for retrieval and generation

weighted mean pooling: A pooling strategy for variable-length sequences where later tokens (closer to the end of processing) are assigned higher weights

MRR@10: Mean Reciprocal Rank at 10—a measure of retrieval quality based on the rank of the first relevant document

Recall@10: The proportion of relevant documents found in the top-10 retrieved results