Ma-rag: Multi-agent retrieval-augmented generation via collaborative chain-of-thought reasoning

📝 Paper Summary

Agentic RAG pipeline Modularized RAG pipeline

MA-RAG is a training-free multi-agent framework that orchestrates specialized agents (Planner, Step Definer, Extractor, QA) via chain-of-thought prompting to solve complex retrieval tasks without fine-tuning.

Core Problem

Standard RAG systems treat retrieval and generation as isolated components, failing to resolve ambiguities and reasoning gaps in complex, multi-hop queries.

Why it matters:

Existing methods struggle with vague queries or scattered evidence, leading to retrieval mismatch and hallucinations.
End-to-end fine-tuning approaches are computationally expensive and lack interpretability in their intermediate reasoning steps.
Naively appending retrieved documents often introduces noise and context overflow, degrading performance.

Concrete Example: For a multi-hop question like 'What conference was the Vermont Catamounts men's soccer team's conference formerly known as...', standard RAG might just retrieve football documents or miss the historical context. MA-RAG decomposes this into structured sub-tasks: finding the team's conference affiliation, then finding that conference's former name.

Key Novelty

Modular Multi-Agent RAG (MA-RAG)

Decomposes RAG into four distinct agent roles (Planner, Step Definer, Extractor, QA) that communicate via structured state updates.
Uses an on-demand strategy where agents are invoked only when necessary based on the Planner's decomposition, rather than a fixed linear pipeline.
Achieves state-of-the-art performance purely through chain-of-thought prompting and agent collaboration, requiring zero model fine-tuning.

Architecture

Overview of the MA-RAG framework, illustrating the interaction between the Planner, Step Definer, Retrieval Tool, Extractor, and QA Agent.

Evaluation Highlights

Surpasses 70B-scale baselines (ChatQA-1.5, RankRAG) using only LLaMA3-8B on NQ (52.5 EM), HotpotQA (52.1 EM), and 2WikimQA (46.4 EM).
Achieves 59.5 EM on Natural Questions with GPT-4o-mini, outperforming standard GPT-4 (40.3 EM) significantly.
Generalizes to medical domain (MedMCQA) without fine-tuning, outperforming domain-specific Meditron-70B and PMC-Llama 13B.

Breakthrough Assessment

8/10

Strong evidence that modular, training-free agents can outperform fine-tuned RAG models. The high performance of small 8B models against larger baselines is particularly impressive.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering and Fact Verification using external knowledge sources.

Inputs: Natural language query q and a document corpus C.

Outputs: Final answer A derived from retrieved evidence.

Pipeline Flow

Group Planning: Planner Agent
Group Execution Loop: Step Definer → Retrieval Tool → Extractor Agent → QA Agent (Repeat for each sub-step)
Final Synthesis: QA Agent

System Modules

Planner Agent

Analyzes input query for ambiguity and decomposes it into a structured plan of reasoning subtasks.

Model or implementation: LLM (e.g., LLaMA3-8B, GPT-4o-mini)

Step Definer Agent (Execution)

Converts abstract plan steps into executable, detailed subqueries for retrieval.

Model or implementation: LLM (e.g., LLaMA3-8B)

Retrieval Tool

Retrieves top-k relevant passages from the corpus based on the subquery.

Model or implementation: Dense Retriever (gte-multilingual) + FAISS

Extractor Agent (Execution)

Filters noise from retrieved chunks and aggregates only relevant sentences.

Model or implementation: LLM (e.g., LLaMA3-8B)

QA Agent

Synthesizes an answer for the current step or the final answer using gathered evidence.

Model or implementation: LLM (e.g., LLaMA3-8B, LLaMA3-70B)

Novel Architectural Elements

On-demand agent invocation strategy driven by a dynamic reasoning plan rather than a fixed sequence.
Separation of 'Step Definer' (query formulation) and 'Extractor' (noise filtering) into distinct agent roles within the loop.

Modeling

Base Model: LLaMA3-8B, LLaMA3-70B, GPT-4o-mini (used as backbones for agents)

Training Method: Prompt Engineering / In-Context Learning only

Compute: Inference only. 8 NVIDIA A6000 GPUs used for experiments. Avg response time: ~2.2s (single-hop), ~4.1s (multi-hop) with GPT-4o-mini.

Comparison to Prior Work

vs. Self-RAG: MA-RAG is training-free and uses explicit multi-agent planning rather than internal self-reflection tokens.
vs. ChatQA-1.5 / RankRAG: MA-RAG achieves comparable or better performance without any fine-tuning, relying on modular architecture.
vs. ReAct: MA-RAG uses a structured Planner-Executor separation rather than a single agent interleaved loop.

Limitations

Higher latency and token cost due to multiple agent invocations per query compared to single-pass RAG.
Performance depends heavily on the capability of the underlying LLM (e.g., Planner requires strong reasoning).
Response time increases for multi-hop questions (approx. 4.1s vs 2.2s for single-hop).

Reproducibility

Code: https://github.com/thangylvp/MA-RAG

Code available at https://github.com/thangylvp/MA-RAG. Implemented using LangChain and LangGraph. Uses public datasets (KILT versions of NQ, HotpotQA, etc.) and public models (LLaMA3, gte-multilingual).

📊 Experiments & Results

Evaluation Setup

Open-domain QA and Fact Verification using Wikipedia corpus (KILT split).

Benchmarks:

Natural Questions (NQ) (Open-domain QA (mostly single-hop))
HotpotQA (Multi-hop QA)
2WikimQA (Multi-hop QA)
TriviaQA (Open-domain QA)
MedMCQA (Medical Domain QA)

Metrics:

Exact Match (EM)
Accuracy (Acc)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrating MA-RAG (8B) outperforming same-size and larger baselines on Open-Domain QA benchmarks.
Natural Questions (NQ)	EM	48.2	52.5	+4.3
HotpotQA	EM	41.6	51.1	+9.5
2WikimQA	EM	43.3	46.4	+3.1
State-of-the-art results using larger/stronger backbone models (Llama3-70B and GPT-4o-mini).
Natural Questions (NQ)	EM	53.6	59.5	+5.9
HotpotQA	EM	50.3	52.1	+1.8
Generalization to Medical Domain without fine-tuning.
MedMCQA	Accuracy	54.6	60.2	+5.6

Main Takeaways

MA-RAG consistently outperforms standalone LLMs and fine-tuned RAG baselines (like ChatQA-1.5 and RankRAG) across model scales, particularly on multi-hop tasks.
Ablation studies show the Planner is essential for multi-hop reasoning (performance drops significantly without it), and the Extractor is crucial for filtering noise.
Model capacity analysis reveals that the QA Agent and Planner benefit most from larger models, while the Step Definer can use smaller models with minimal performance loss.
The framework generalizes well to specialized domains like medicine (MedMCQA) without any domain-specific training.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) principles
Chain-of-Thought (CoT) prompting
Large Language Model (LLM) inference
Multi-agent system architectures

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that enhance generation by retrieving external documents.

Chain-of-Thought: A prompting technique where models generate intermediate reasoning steps before the final answer.

Zero-shot/Training-free: The system relies on pre-trained models without updating their weights via gradient descent.

Multi-hop QA: Question answering tasks requiring reasoning across multiple documents or pieces of evidence.

EM: Exact Match—a metric measuring if the generated answer exactly matches the ground truth.

Dense Retrieval: Retrieval based on semantic vector embeddings rather than keyword matching.

FAISS: Facebook AI Similarity Search—a library for efficient similarity search of dense vectors.

LangGraph: A library for building stateful, multi-agent applications with LLMs using graph structures.