MRAMG-Bench: A BeyondText Benchmark for Multimodal Retrieval-Augmented Multimodal Generation

📝 Paper Summary

Multimodal RAG Benchmark datasets Evaluation methodology

MRAMG-Bench establishes a rigorous evaluation framework for Multimodal Retrieval-Augmented Multimodal Generation, requiring models to autonomously select, order, and interleave images with text to answer complex queries.

Core Problem

Existing RAG methods typically retrieve multimodal data but generate text-only answers, failing to leverage visual information directly in the response.

Why it matters:

Users often prefer visual answers ('show, don't tell') for tasks like recipe instructions or identifying objects, where text alone is insufficient
Current benchmarks lack support for evaluating integrated text-image generation, relying on subjective or inconsistent metrics
MLLMs frequently hallucinate when describing visual content, whereas presenting the original image avoids description errors

Concrete Example: When asked 'What does a cat look like?', a text description is less effective than a photograph. Similarly, a step-by-step recipe is much clearer when text instructions are interleaved with images of the preparation steps, which current text-only RAG systems cannot produce.

Key Novelty

MRAMG-Bench: A benchmark for generating interleaved text-and-image answers

Defines the MRAMG task: generating answers that seamlessly integrate text and images retrieved from a multimodal corpus
Constructs a high-quality human-annotated dataset where models must decide which images to use, how many to use, and where to place them in the text
Introduces a statistically grounded evaluation framework that assesses both retrieval accuracy and the quality of multimodal generation (ordering, relevance, coherence)

Architecture

The data construction pipeline for MRAMG-Bench, showing the flow from raw data to the final benchmark

Evaluation Highlights

Benchmarked 11 popular generative models, revealing significant gaps in current MLLM capabilities for interleaved generation
Introduced a new dataset comprising 4,800 QA pairs and 14,190 images across diverse domains (Web, Academia, Lifestyle)
Proposed a unified generation framework allowing both LLMs and MLLMs to perform the MRAMG task via rule-based or model-based image insertion

Breakthrough Assessment

9/10

First comprehensive benchmark specifically for RAG systems that generate multimodal outputs (text + image), addressing a critical gap in current evaluation methodologies.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Retrieval-Augmented Multimodal Generation (MRAMG)

Inputs: Text query q and a multimodal knowledge base D = {d1, ..., dn} where documents contain interleaved text and images

Outputs: Multimodal answer A containing interleaved generated text and selected images from the retrieved documents

Pipeline Flow

Multimodal Retrieval (retrieve top-k multimodal documents)
Multimodal Answer Generation (generate text and select/insert images)

System Modules

Retriever

Retrieve relevant multimodal documents based on the query

Model or implementation: Not explicitly specified (evaluates generative models assuming retrieval is provided or part of the system)

Generator

Generate the answer text and select appropriate images to insert at specific positions

Model or implementation: Evaluated on 11 models (e.g., GPT-4o, diverse MLLMs)

Novel Architectural Elements

Proposed generation framework: Integration of rule-based and model-based approaches to allow text-only LLMs to perform multimodal generation by outputting image placeholders/IDs

Modeling

Base Model: Benchmarked 11 generative models (including GPT-4o)

Training Data:

MRAMG-Wit (Web domain)
MRAMG-Wiki (Web domain)
MRAMG-Web (Web domain)
MRAMG-Arxiv (Academia domain)
MRAMG-Recipe (Lifestyle domain)
MRAMG-Manual (Lifestyle domain)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MuRAG: MRAMG generates interleaved text-image answers, not just text
vs. WebQA: MRAMG requires the model to output images as part of the answer, whereas WebQA is text-output only
vs. M2RAG: MRAMG-Bench is significantly larger (4,800 vs 200 pairs), covers more domains (Academia, Lifestyle), and uses a single-pass generation framework rather than multi-stage calls

Limitations

Evaluation relies partially on LLM-based metrics which may have inherent biases
High reliance on GPT-4o for portions of the data construction pipeline (though human verified)
The specific retriever performance is not the focus, which might bottleneck the end-to-end performance analysis

Reproducibility

Code: https://github.com/MRAMG-Bench/MRAMG

Datasets and evaluation code are publicly available at https://github.com/MRAMG-Bench/MRAMG. The benchmark includes 4,800 QA pairs, 4,346 documents, and 14,190 images.

📊 Experiments & Results

Evaluation Setup

Multimodal generation given retrieved context

Benchmarks:

MRAMG-Bench (Multimodal Retrieval-Augmented Multimodal Generation) [New]

Metrics:

Statistical metrics (Recall, Precision, F1 for image selection)
LLM-based metrics (Relevance, Coherence, Image-Text Alignment)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper introduces the benchmark and mentions evaluating 11 models but does not provide a specific results table in the text provided. The focus of the text is on dataset construction and task formulation.

Main Takeaways

Constructed a diverse benchmark with three difficulty levels: Easy (Web), Medium (Academia), and Difficult (Lifestyle)
Established that existing text-only RAG benchmarks are insufficient for evaluating the placement and relevance of images in generated answers
Demonstrated that 'Show, don't tell' is a critical capability for next-generation AI assistants

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Multimodal Large Language Models (MLLMs)
Basic information retrieval metrics (Recall, Precision)

Key Terms

MRAMG: Multimodal Retrieval-Augmented Multimodal Generation—generating answers that seamlessly integrate both text and retrieved images

Interleaved content: Data format where text paragraphs and images appear in a specific sequence (e.g., Text1, Image1, Text2...)

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

MinHash: A technique for quickly estimating the similarity between two sets, used here for deduplicating images

MinerU: A tool used to parse PDF documents into markdown format while preserving the structure of text and images

MRAMG-Bench: The proposed benchmark consisting of datasets across Web, Academia, and Lifestyle domains

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents