Rag-ddr: Optimizing retrieval-augmented generation using differentiable data rewards

📝 Paper Summary

Modularized RAG pipeline Agentic RAG pipeline

DDR (Differentiable Data Rewards) optimizes RAG systems end-to-end by propagating rewards from the final output back to each module, aligning their data preferences to reduce knowledge conflicts.

Core Problem

Current RAG optimization methods (like SFT) train modules independently or overfit to training signals, failing to align the data preferences between the retrieval/refinement module and the generation module.

Why it matters:

Misalignment leads to the generator ignoring relevant retrieved context or being misled by noise.
Independent optimization overlooks that the generator often faces knowledge conflicts between parametric memory and external evidence.
SFT approaches can cause catastrophic forgetting and do not account for how downstream agents actually utilize the data provided by upstream agents.

Concrete Example: In a standard pipeline, a retriever might provide documents that look relevant but contain subtle conflicts with the generator's internal knowledge. An SFT-trained generator might hallucinate or ignore these documents. DDR trains the generator to signal which documents actually help it answer correctly, then updates the refinement module to prioritize those specific documents.

Key Novelty

Differentiable Data Rewards (DDR) for End-to-End RAG Alignment

Uses a rollout method to collect rewards from the entire RAG system's final output and back-propagates them to optimize individual agents (modules).
Employs Direct Preference Optimization (DPO) to align the data preferences of the Knowledge Refinement module with the Generation module, ensuring retrieved data is actually useful for generation.
Iteratively optimizes agents: first the generator learns to use data effectively, then the refinement module learns to select data that maximizes the generator's performance.

Architecture

Overview of the RAG-DDR training process involving data propagation and differentiable data rewards.

Evaluation Highlights

Outperforms RA-DIT (SFT-based method) by +3.54 EM on Natural Questions and +2.85 EM on TriviaQA using Llama-2-7B.
Achieves higher performance with smaller models (Llama-2-7B) than larger baselines (Llama-2-13B) on knowledge-intensive tasks, showing effective parameter efficiency.
Reduces the average response length on PubHealth by ~35% compared to vanilla RAG while maintaining higher accuracy, indicating more concise and precise generation.

Breakthrough Assessment

7/10

Solid methodological improvement for multi-agent RAG alignment using DPO. While the core idea of end-to-end training exists, applying DPO via rollout for module alignment is a strong, practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (QA) and Fact Verification

Inputs: Query q and a set of retrieved documents D

Outputs: Final response y_T

Pipeline Flow

Input Query -> Retrieval (Contriever) -> Knowledge Refinement Module (Filter Documents) -> Generation Module -> Final Answer

System Modules

Retriever

Retrieve top-k candidate documents from a corpus based on the query

Model or implementation: Contriever-MSMARCO

Knowledge Refinement Module (V_KR)

Classify each retrieved document as relevant ('YES') or irrelevant ('NO') to construct a refined context

Model or implementation: Llama-2-7B / Llama-2-13B (Optimized via DDR)

Generation Module (V_Gen)

Generate the final answer using the query and the refined documents

Model or implementation: Llama-2-7B / Llama-2-13B (Optimized via DDR)

Novel Architectural Elements

Iterative bi-level optimization where the Generation Module is trained first to establish preference signals, followed by the Knowledge Refinement Module trained via rollout feedback from the updated Generator.

Modeling

Base Model: Llama-2-7B and Llama-2-13B

Training Method: Differentiable Data Rewards (DDR) using DPO

Objective Functions:

Purpose: Optimize the agent to align with system preferences.

Formally: L_DPO = -E_{(x, y_w, y_l) ~ D} [log σ(β * log( π_θ(y_w|x)/π_ref(y_w|x) ) - β * log( π_θ(y_l|x)/π_ref(y_l|x) ) )]

Trainable Parameters: LoRA adapters (r=8, alpha=16)

Training Data:

Natural Questions (NQ): 79k train / 8k dev / 3k test
TriviaQA: 78k train / 8k dev / 11k test
PubHealth: 9k train / 1k dev / 1k test
Arc-C: 1k train / 200 dev / 1k test

Key Hyperparameters:

learning_rate: 1e-4 (Generation), 1e-5 (Refinement)
batch_size: 64 (Generation), 32 (Refinement)
beta (DPO): 0.1
+ 2 more
LoRA_r: 8
LoRA_alpha: 16

Compute: 8 NVIDIA A800-80G GPUs

Comparison to Prior Work

vs. RA-DIT: DDR uses DPO and rollout rewards for end-to-end alignment rather than separate SFT objectives.
vs. Self-RAG: DDR aligns modules via latent rewards without requiring extensive manual annotation of reflection tokens or special instruction data.
vs. REPLUG [not cited in paper]: REPLUG treats retrieval as a latent variable optimized via perplexity; DDR optimizes a refinement module via discrete rewards (EM/F1) and DPO.

Limitations

Computational cost of the rollout method is high due to repeated inference during training.
The method currently focuses on a two-agent system (Refinement + Generation) and fixes the initial dense retriever.
Experiments are limited to Llama-2 models; effectiveness on very large or proprietary models is not tested.

Reproducibility

Code: https://github.com/OpenMatch/RAG-DDR

Code is publicly available at https://github.com/OpenMatch/RAG-DDR. Paper details datasets and hyperparameters clearly.

📊 Experiments & Results

Evaluation Setup

Open-domain QA and Fact Verification using Wikipedia dumps (Dec 2018)

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)
PubHealth (Fact Verification)
Arc-C (Multiple Choice QA)

Metrics:

Exact Match (EM)
Accuracy
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Open-Domain QA tasks showing improvements over SFT and baselines.
Natural Questions	EM	46.22	49.76	+3.54
TriviaQA	EM	69.09	71.94	+2.85
Natural Questions	EM	48.23	49.76	+1.53
Results on Fact Verification tasks demonstrating robustness.
PubHealth	Accuracy	65.35	73.18	+7.83
Ablation study analyzing the contribution of optimizing each module.
Natural Questions	EM	45.04	49.76	+4.72
Natural Questions	EM	49.56	49.76	+0.20

Experiment Figures

Impact of different document quantities (top-k) on Natural Questions performance.

Performance comparison when injecting noise documents (Golden vs. Golden + Noise).

Main Takeaways

DDR consistently outperforms SFT-based methods (RA-DIT) and specialized architectures (Self-RAG) across multiple benchmarks.
The generation module's optimization contributes most to the performance gains, indicating that teaching the LLM to use retrieved data effectively is more critical than just refining the data.
DDR effectively handles noisy retrieval contexts; experiments adding top-20 to top-30 documents as noise showed DDR maintains performance better than baselines.
Qualitative analysis shows DDR generates more concise answers by effectively filtering irrelevant information.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Reinforcement Learning (RL) concepts (Reward, Rollout)
Direct Preference Optimization (DPO)

Key Terms

DPO: Direct Preference Optimization—a method to fine-tune language models to human or system preferences without an explicit reward model, using a binary cross-entropy objective on preference pairs.

Rollout: A technique where an agent simulates future steps (interactions with subsequent agents) to estimate the long-term reward of a current action.

SFT: Supervised Fine-Tuning—training a model on labeled examples using standard maximum likelihood estimation.

Knowledge Refinement Module: A module that filters or selects a subset of retrieved documents to remove noise before passing them to the generator.

Parametric Memory: Knowledge stored within the weights of the LLM itself, acquired during pre-training.

Knowledge Conflict: A situation where the information in retrieved documents contradicts the LLM's internal parametric memory.

EM: Exact Match—an evaluation metric that measures the percentage of predictions that match the ground truth answer exactly.