Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

📝 Paper Summary

Agentic RAG pipeline

CoRAG jointly optimizes a reranker and generator as cooperative agents using a shared task-oriented reward to eliminate the generator's asymmetric dependency on perfect ranking.

Core Problem

Existing RAG systems use a ranking-centric, asymmetric pipeline where the generator is highly sensitive to reranking errors, requiring the reranker to learn difficult fine-grained orderings.

Why it matters:

Suboptimal rerankers that misplace relevant documents (even if in the top-N set) can cause generator failure due to strict dependency
Learning exact total ordering of documents is harder than relaxed ordering, creating an optimization mismatch between reranker difficulty and generator sensitivity

Concrete Example: If a reranker places a less relevant document at position 1 and the optimal document at position 3, a standard generator might hallucinate an answer based on the first document, even though the correct information is available in the context window.

Key Novelty

Cooperative Retrieval-Augmented Generation (CoRAG)

Reformulate RAG as a multi-agent problem where reranker and generator are peer decision-makers optimized for a shared final outcome rather than separate metrics
Transform delayed task rewards (did the answer match?) into document-level stochastic preference signals to train the reranker without explicit relevance labels

Architecture

Overview of CoRAG framework showing the interaction between Reranker and Generator and the shared reward mechanism.

Evaluation Highlights

Achieves 71.2% accuracy on PopQA (trained only on ~10K PopQA samples), significantly outperforming RetRobust and InstructRAG-FT
Demonstrates strong generalization to unseen datasets: 81.0% accuracy on TriviaQA and 72.4% on Natural Questions without training on them
Outperforms baselines in code generation (HumanEval pass@1) and table QA (WikiTable Questions), showing cross-domain robustness

Breakthrough Assessment

8/10

Strong performance with limited training data (10K samples) and excellent zero-shot generalization to other datasets suggest the cooperative formulation fundamentally solves the alignment issue in RAG.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering with retrieval and reranking

Inputs: Query q and candidate document set D

Outputs: Generated response a_hat

Pipeline Flow

Group Name: Retrieval & Reranking: Retriever → Reranker
Group Name: Generation: Generator

System Modules

Retriever (Retrieval & Reranking)

Retrieve candidate documents from external corpus

Model or implementation: Not explicitly specified (implied standard dense retriever)

Reranker (Retrieval & Reranking)

Select and order top-K documents based on relevance to query

Model or implementation: BGE-Reranker-v2-m3

Generator

Synthesize final response based on query and selected documents

Model or implementation: Llama-3-Instruct-8B

Novel Architectural Elements

Joint optimization loop where reranker receives feedback from generator's task success (answer correctness) rather than fixed relevance labels

Modeling

Base Model: Llama-3-Instruct-8B (Generator) and BGE-Reranker-v2-m3 (Reranker)

Training Method: Cooperative Multi-Agent Reinforcement Learning (via GRPO and Pairwise Ranking)

Objective Functions:

Purpose: Optimize reranker using pairwise preferences derived from estimated task success.

Formally: Margin-based pairwise ranking loss L_rank(θ) = sum(max(0, γ - (S_θ(q, d_i) - S_θ(q, d_j)))) for d_i in D+, d_j in D-
Purpose: Optimize generator to maximize task reward using group relative advantage.

Formally: GRPO objective L_gen(ϕ) = -E[min(ratio * A_hat, clip(ratio, 1-ε, 1+ε) * A_hat)]
Purpose: Estimate document contribution to task success.

Formally: Smoothed Bernoulli parameter p_i = (count(success w/ d_i) + α) / (count(trials w/ d_i) + 2α)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters for both Reranker and Generator

Training Data:

Trained only on ~10K samples from PopQA dataset
Coarse annotations from Llama-3 used to warm-start reranker preference labels

Key Hyperparameters:

reranker_learning_rate: 5e-5
generator_learning_rate: 1e-5
margin_gamma: 1
+ 3 more
training_top_k: 1 (to attribute impact)
inference_top_k: 3 (PopQA/TriviaQA/NQ/ASQA), 7 (2WikiMultiHopQA)
generator_temperature: 0.7

Compute: Not reported in the paper

Comparison to Prior Work

vs. RetRobust: Jointly optimizes reranker and generator rather than just making generator robust to noise
vs. InstructRAG: Uses shared task reward to align components rather than independent fine-tuning; trained on single dataset (PopQA) vs dataset-specific training
vs. RAFT [not cited in paper]: CoRAG optimizes the retrieval side (reranker) cooperatively, whereas RAFT focuses on generator adaptation to provided context

Limitations

Underperforms on ASQA dataset due to task discrepancy (multi-answer synthesis vs. factoid training on PopQA)
Success relies on generator capability; if generator ignores context completely, reranker cannot learn
Trade-off: Generator becomes robust to ranking, potentially reducing the gradient signal/necessity for the reranker to be perfect

Reproducibility

Code: https://anonymous.4open.science/r/CoRAG-D63F

Code publicly available at https://anonymous.4open.science/r/CoRAG-D63F. Uses standard datasets (PopQA, TriviaQA, NQ). Llama-3 annotations used for warm-up, but specific prompt templates for this annotation are not detailed in main text.

📊 Experiments & Results

Evaluation Setup

Open-domain QA trained on PopQA and evaluated zero-shot on other benchmarks

Benchmarks:

PopQA (Long-tail entity QA)
TriviaQA (Factoid QA)
Natural Questions (NQ) (Open-domain QA)
2WikiMultiHopQA (Multi-hop QA)
ASQA (Ambiguous QA)
HumanEval (Code Generation)
WikiTable Questions (Table QA)

Metrics:

Accuracy (Exact Match)
Pass@1
Pass@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing CoRAG's performance against baselines across multiple QA datasets. Note that CoRAG is trained only on PopQA.
PopQA	Accuracy	66.2	71.2	+5.0
TriviaQA	Accuracy	78.4	81.0	+2.6
Natural Questions (NQ)	Accuracy	59.3	72.4	+13.1
2WikiMultiHopQA	Accuracy	42.7	58.2	+15.5
Ablation studies isolating the impact of joint training vs. individual component training.
PopQA	Accuracy	63.5	71.2	+7.7
PopQA	Accuracy	51.8	71.2	+19.4
Cross-domain generalization results on code and table tasks.
HumanEval+	Pass@1	56.1	62.2	+6.1

Experiment Figures

Performance trends (Accuracy) on PopQA and NQ as the number of retrieved documents (Top-1, Top-3, Top-5) increases.

Main Takeaways

CoRAG achieves state-of-the-art results on 4/5 QA datasets while being trained on only one (PopQA), demonstrating exceptional generalization.
Joint optimization is crucial: CoRAG outperforms independently trained reranker (Rtrain) and generator (Gtrain) variants.
The generator component contributes more to performance gains than the reranker, as shown by swap experiments (RGReplace vs GReplace).
Performance remains robust as the number of retrieved documents increases (Top-5), unlike baselines like InstructRAG which degrade due to noise.
Generalizes effectively to non-QA tasks like Code Generation (HumanEval) and Table QA, suggesting the generator learns robust information extraction.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Reinforcement Learning (specifically GRPO/PPO concepts)
Learning to Rank (pairwise preferences)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on the relative advantage of an output compared to others in a group

Reranker: A module that re-orders retrieved documents to place the most relevant ones at the top for the generator

Asymmetric dependency: The standard RAG paradigm where generation quality depends strictly on the reranker's output, but the reranker is not optimized for generation success

Stochastic preference labels: Probabilistic labels generated from task success rates used to train the reranker when ground-truth ranking is unavailable

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights

PopQA: A question answering dataset focusing on long-tail entities, used here as the sole training source

Pass@k: A code generation metric measuring the probability that at least one of k generated code samples passes unit tests