MsRAG: Knowledge Augumented Image Captioning with Object-level Multi-sourceRAG

📝 Paper Summary

Knowledge-Augmented Image Captioning Modularized RAG pipeline Multimodal RAG

MsRAG enables Language-Visual Large Models to generate knowledge-rich image captions without user queries by combining object-level visual search, offline domain databases, and a visual-text alignment mechanism.

Core Problem

Existing RAG methods for vision-language models rely heavily on user queries to retrieve relevant information, making them ineffective for image captioning tasks where no explicit text query exists.

Why it matters:

Standard pre-trained models suffer from hallucinations or lack specific knowledge about rare objects (e.g., specific cultural relics or products)
Current RAG systems cannot infer user intent without a query, leading to retrieval of irrelevant or noisy information
Automatic generation of high-quality, knowledge-rich captions is crucial for training better multimodal models and real-world applications

Concrete Example: When captioning an image of four football players without a user query, a standard model just says 'four German football players.' MsRAG automatically identifies each face, retrieves their names (e.g., 'Bastian Schweinsteiger'), aligns them to their positions (left-to-right), and generates a caption naming each specific player.

Key Novelty

Object-Level Multi-Source Retrieval without Queries

Retrieves information based on detected visual objects rather than text queries, using both online search engines (for timeliness) and offline databases (for domain depth like cultural relics)
Visual-RAG Alignment: Explicitly maps text retrieval results to specific image regions using visual markers (bounding boxes/numbers) in the prompt, preventing the model from attributing facts to the wrong object

Architecture

The complete MsRAG pipeline processing an input image to generate a caption.

Evaluation Highlights

Outperforms standard mRAG by +21.9% on CIDEr score using GPT-4o on the KAC-dataset
Achieves higher user preference scores in human evaluation across all domains, particularly in cultural relics (+3.24) and products (+3.18) compared to mRAG
Boosts Qwen2-VL performance on Kale dataset by +3.1 in BLEU-4 compared to mRAG baseline

Breakthrough Assessment

7/10

Addresses a specific but critical gap (query-free RAG) with a practical pipeline. The Visual-RAG alignment is a clever adaptation of Set-of-Marks for RAG. While architectural novelty is moderate (pipeline of existing tools), practical utility for captioning is high.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-augmented image captioning without user queries

Inputs: Input Image I (without any text prompt/query)

Outputs: Knowledge-rich caption C describing entities and context in I

Pipeline Flow

Pre-process (Object Detection & Cropping)
Parallel Visual Search (Online + Offline Retrieval)
RAG Content Summary (LLM summarization)
Prompt Selection (Template matching)
Visual-RAG Alignment (Visual prompting)
Caption Generation (LVLM inference)

System Modules

Pre-process Module

Detect objects and crop sub-images for individual entity search

Model or implementation: Not explicitly specified (likely a standard object detector like YOLO or GroundingDINO)

Parallel Visual Search (Retrieval & Selection)

Retrieve knowledge for each object from web search engines and domain-specific databases

Model or implementation: Google/Baidu/Bing APIs (Online) + Custom Vector Database (Offline)

RAG Content Summary (Retrieval & Selection)

Condense retrieved raw text into concise knowledge snippets

Model or implementation: Light-weight LLM (specific model not named)

Visual-RAG Alignment Module (VRAM)

Bridge modality gap by associating text knowledge with specific visual locations via markers

Model or implementation: Algorithm (Set-of-Marks style marking)

Caption Generator

Generate final descriptive caption integrating visual and textual knowledge

Model or implementation: LVLM (GPT-4o, Claude-3.5, Qwen2-VL, or InternVL2)

Novel Architectural Elements

Parallel retrieval architecture combining generic online search with specialized offline domain databases (Cultural Relics, Products) triggered by object detection
VRAM (Visual-RAG Alignment Module) that dynamically constructs prompts mapping retrieved text facts to visual bounding box markers

Modeling

Base Model: Evaluated on GPT-4o, Claude-3.5-Sonnet, Qwen2-VL, InternVL2

Training Method: Tuning-free framework (Inference-only RAG)

Compute: Run on two Nvidia A100s

Comparison to Prior Work

vs. mRAG: MsRAG uses object-level retrieval and explicit visual-text alignment (VRAM) rather than global retrieval
vs. CapFusion: MsRAG is an inference-time framework for open-domain images, whereas CapFusion focuses on training data generation
vs. SnapNTell [not cited in paper]: Similar entity-centric retrieval, but MsRAG integrates offline domain-specific databases to handle commercial/cultural entities where web search fails

Limitations

Dependency on the performance of the underlying object detection model
Offline database coverage is limited to specific domains (Culture, Products, etc.)
Inference latency likely higher due to multi-object search and summarization steps (though not explicitly quantified)
Performance gains in 'Character' domain are marginal due to strong pre-training of base models

Reproducibility

Code: Not reported in the paper

Code availability is not provided. KAC-dataset is introduced but no URL provided in text. Offline database details (size, tracks) are provided in tables.

📊 Experiments & Results

Evaluation Setup

Knowledge-based image captioning on open-domain images

Benchmarks:

KAC-dataset (Knowledge-Augmented Captioning (Multidomain)) [New]
CapFusion (Image Captioning)
Kale (Dense Captioning)

Metrics:

BLEU-1, BLEU-4
METEOR
CIDEr
User Preference Score (1-10)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Quantitative performance on public benchmarks shows MsRAG consistently improving caption quality over the mRAG baseline across different LVLMs.
Kale	CIDEr	5.4	16.7	+11.3
Kale	BLEU-4	2.9	9.4	+6.5
CapFusion	CIDEr	2.0	5.2	+3.2
Kale	CIDEr	16.8	13.3	-3.5
Ablation studies demonstrate the critical role of the Visual-RAG Alignment Module (VRAM) and Offline Databases.
KAC-dataset (Human Eval)	Knowledge Score (0-1)	0.41	0.77	+0.36
KAC-dataset (Human Eval)	Knowledge Score (0-1)	0.21	0.77	+0.56

Experiment Figures

Ablation study qualitative and quantitative results.

Main Takeaways

MsRAG significantly improves captioning for knowledge-intensive domains (Cultural Relics, Products) where standard models hallucinate or lack detail.
The Visual-RAG Alignment Module (VRAM) prevents 'hallucination transfer' where the model attributes retrieved facts to the wrong object in the image.
Commercial models (GPT-4o, Claude) show larger relative gains from MsRAG than open-source models, suggesting better instruction-following capabilities for complex RAG prompts.
Offline domain databases provide critical redundancy; online search alone failed in ~62% of Cultural Relics cases in their dataset.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Language-Visual Large Models (LVLMs)
Object Detection

Key Terms

RAG: Retrieval-Augmented Generation—systems that fetch external data to improve AI responses

LVLM: Language-Visual Large Model—AI models capable of understanding both text and images (e.g., GPT-4o, Qwen-VL)

Set of Marks (SoM): A prompting technique where objects in an image are overlaid with visible numeric markers to help the model reference specific regions

CIDEr: Consensus-based Image Description Evaluation—a metric for image captioning that measures consensus between a candidate caption and reference captions

BLEU: Bilingual Evaluation Understudy—a metric measuring text overlap between generated and reference text, often used for translation and captioning

METEOR: Metric for Evaluation of Translation with Explicit ORdering—a text evaluation metric that correlates better with human judgment than BLEU by using synonyms and stemming

Visual-RAG Alignment: The paper's proposed method of mapping retrieved text segments to specific visual regions using spatial prompts and markers

mRAG: A baseline multimodal RAG approach that typically uses image-to-text retrieval or simple query-based retrieval

KAC-dataset: Knowledge-Augmented Captioning dataset—a new benchmark introduced in this paper covering diverse domains like cultural relics and products