CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

📝 Paper Summary

Visual Question Answering (VQA) Multi-modal RAG (MM-RAG) Wearable AI

CRAG-MM is a large-scale multi-modal benchmark specifically designed for wearable AI, featuring egocentric images, multi-turn conversations, and realistic retrieval challenges to evaluate MM-RAG systems.

Core Problem

Existing VQA benchmarks rely on high-quality images and common knowledge, failing to capture the low-quality egocentric imagery and dynamic, factual information needs of modern wearable AI users.

Why it matters:

Wearable devices like smart glasses capture egocentric images that are often blurred, occluded, or poorly lit, unlike standard VQA datasets
Users ask factual questions (prices, history) that cannot be answered from images alone, requiring robust external retrieval
Current benchmarks lack multi-turn conversations with domain shifts, which are essential for natural human-AI interaction

Concrete Example: A user wearing smart glasses asks 'What is the price of this sofa on Amazon?' while looking at a sofa. Without RAG, a model hallucinates a price. With RAG, it must retrieve correct info despite the image potentially being blurry or the sofa being partially occluded.

Key Novelty

Realistic Wearable MM-RAG Benchmark

Incorporates 7.9K images with 79% being egocentric (first-person view), deliberately including low-quality captures (blur, low-light) to mimic real wearable hardware constraints
Provides 2K multi-turn conversations where ~38% involve domain shifts, simulating natural topic drift in user dialogue
Includes a realistic mock retrieval environment with both Image-KG APIs (structured data) and Web Search APIs (800K webpages) containing noise to test system robustness

Architecture

The three tasks defined in CRAG-MM and the flow of information for each. It illustrates the different retrieval sources available for each task.

Evaluation Highlights

State-of-the-art industry solution (GPT-5) achieves only 63% accuracy with 31% hallucinations on single-turn QA, underscoring significant room for improvement
Low-quality egocentric images degrade truthfulness by up to 46% compared to normal images across evaluated systems
Straightforward RAG approaches improve accuracy over MM-LLM-only baselines (37% to 50%) but still fall short of industry solutions (63%)

Breakthrough Assessment

9/10

First comprehensive benchmark for Multi-Modal RAG specifically targeting wearable use cases. The inclusion of egocentric, low-quality images and multi-turn retrieval tasks fills a critical gap in VQA evaluation.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal Retrieval-Augmented Generation (MM-RAG) where systems answer questions based on an input image and external knowledge

Inputs: Image I, Question Q, optionally conversation history H (for multi-turn)

Outputs: Answer A generated using retrieved information

Pipeline Flow

Input Processing (Image + Question)
Retrieval (Image-KG Search / Web Search)
Generation (Answer Synthesis)

System Modules

Image Search API (Retrieval)

Retrieve similar images and structured metadata from the mock Knowledge Graph

Model or implementation: CLIP ViT-L/14@336px (for embedding)

Web Search API (Retrieval)

Retrieve relevant webpages based on text queries

Model or implementation: BGE (for embedding chunks)

MM-LLM

Synthesize final answer using image, question, and retrieved context

Model or implementation: Evaluated on various models (e.g., Llama-3.2-90B-Vision-Instruct, GPT-5)

Novel Architectural Elements

Benchmark design: Integration of specific 'wearable' constraints (egocentric, low-quality) directly into the retrieval and evaluation loop

Modeling

Base Model: Evaluated: Llama-3.2-90B-Vision-Instruct, GPT-5-mini, Gemini-2.5-Flash

Comparison to Prior Work

vs. CRAG: Adds visual modality and egocentric images [cited in paper]
vs. VQA v2.0: Focuses on factual retrieval (prices, specs) rather than just visual recognition [cited in paper]
vs. OK-VQA: Questions derived from real wearable use cases, includes multi-turn conversations and mock APIs for realistic retrieval simulation [cited in paper]
+ 2 more
vs. FreshQA [not cited in paper]: FreshQA focuses on changing world knowledge for text; CRAG-MM adds the visual dimension and wearable context
vs. Infoseek [not cited in paper]: Infoseek also does visual information seeking but lacks the specific focus on low-quality egocentric images and multi-turn dialogue found in CRAG-MM

Limitations

Retrieval corpus is a 'mock' static set (68K images, 800K pages), smaller than real web scale
Evaluation relies heavily on LLM-as-a-judge (though validated with 99.1% accuracy)
Image search limited to KG-search API only; does not support reverse image search on the full web
Multi-turn evaluation stops early after two errors, which may penalize systems that can recover context later

Reproducibility

Code: https://huggingface.co/crag-mm-2025

📊 Experiments & Results

Evaluation Setup

Single-turn and Multi-turn QA on wearable-centric images using provided mock retrieval APIs

Benchmarks:

CRAG-MM Task 1 (Single-turn QA (Image-KG Retrieval only)) [New]
CRAG-MM Task 2 (Single-turn QA (Image-KG + Web Retrieval)) [New]
CRAG-MM Task 3 (Multi-turn QA) [New]

Metrics:

Truthfulness (Accuracy where correct=1, missing=0, wrong=-1)
Accuracy (Percentage of correct answers)
Hallucination Rate (Percentage of incorrect answers)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Single-turn QA tasks shows that while RAG helps, even SOTA industry models struggle with hallucinations and accuracy on this benchmark.
CRAG-MM Task 2	Accuracy	50	63	+13
CRAG-MM Task 2	Truthfulness	32	32	0
CRAG-MM Task 2	Hallucination Rate	18	31	+13
Multi-turn QA results highlight the difficulty of maintaining context and accuracy over conversational turns.
CRAG-MM Task 3	Accuracy	61	70	+9
CRAG-MM Task 3	Truthfulness	43	45	+2

Experiment Figures

Analysis of SOTA and Winning solutions across different data slices: Image Quality, Entity Popularity, Question Type, and Multi-turn capability.

Main Takeaways

Low-quality images significantly degrade performance: truthfulness drops by up to 46% on blurred/occluded inputs compared to normal images.
Torso-to-tail entities are much harder to recognize visually than popular entities, leading to lower retrieval performance.
Multi-turn conversation is a major bottleneck: even the best industry system has 27% of sessions early-stopped due to consecutive failures.
Straightforward RAG improves accuracy over no-RAG baselines but often increases hallucinations, suggesting naive retrieval integration is insufficient.

📚 Prerequisite Knowledge

Prerequisites

Visual Question Answering (VQA)
Retrieval-Augmented Generation (RAG)
Knowledge Graphs (KG)
Egocentric vision

Key Terms

MM-RAG: Multi-Modal Retrieval-Augmented Generation—systems that answer questions about images by retrieving external text or structured data

Egocentric images: First-person perspective images captured by wearable devices like smart glasses

Hallucination: When a model generates factually incorrect information not supported by the retrieved context or image

Mock API: Simulated search interfaces provided by the benchmark to access the knowledge graph and webpage corpus

Simple-recognition: Questions answerable directly from the image (e.g., brand name visible on product)

Simple-knowledge: Questions requiring external facts (e.g., price of a product shown in image)

Torso-to-tail entities: Entities with medium (torso) to low (tail) popularity, which are harder for models to recognize than popular (head) entities

LLM-as-a-judge: Using a strong Language Model to evaluate the correctness of answers generated by other models