RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation

📝 Paper Summary

Modularized RAG pipeline

RPO aligns language models to implicitly evaluate retrieval relevance during generation by incorporating retrieval quality into the preference optimization reward function, avoiding expensive external evaluation steps.

Core Problem

RAG systems often over-rely on retrieved context even when it is irrelevant or incorrect, leading to hallucinations and knowledge conflicts with the model's internal memory.

Why it matters:

Standard DPO (Direct Preference Optimization) forces models to prefer either parametric or non-parametric knowledge globally, rather than adapting based on specific retrieval quality
Existing 'adaptive RAG' solutions require computationally expensive pre-evaluation or post-evaluation steps (multiple LLM calls) to assess retrieval quality
Fabricating parametric answers for standard DPO training creates a distribution shift that hinders model convergence and performance

Concrete Example: If a user asks a question where the model knows the correct answer but the retriever fetches an incorrect document, standard RAG might blindly trust the document. Conversely, if the model hallucinates but the retrieval is correct, it might still ignore the retrieval. RPO teaches the model to discern which source (internal vs. external) is correct for the specific instance.

Key Novelty

Retrieval Preference Optimization (RPO)

Modifies the preference optimization objective to include a retrieval-awareness term in the reward model, rather than treating retrieval as a fixed input
Constructs training pairs by intentionally inducing knowledge conflicts: generating answers with and without retrieval, then labeling the preferred answer based on factual accuracy
Integrates retrieval evaluation directly into the generation process (implicit evaluation), eliminating the need for separate classifier modules or multi-step inference

Architecture

Comparison of RPO against standard RAG and Adaptive RAG pipelines

Evaluation Highlights

Outperforms standard RAG by 6.4% to 10.6% accuracy on PopQA, NQ, and TriviaQA using LLaMA3-8B-Instruct
Surpasses adaptive RAG baselines (Self-RAG, CRAG) while maintaining the inference speed of standard RAG (single generation pass)
Achieves +23.3% accuracy improvement over RAG on Natural Questions (NQ) when using LLaMA2-hf-7b

Breakthrough Assessment

7/10

Strong performance gains and theoretical grounding for why DPO fails in RAG. The method offers a significant efficiency advantage over existing adaptive RAG methods by removing inference-time overhead.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering with potential knowledge conflicts between parametric memory and retrieved context

Inputs: Natural language question x and retrieved documents D_r

Outputs: Generated answer y

Pipeline Flow

Retriever (fetches documents)
Generator (LLM aligned with RPO produces answer)

System Modules

Retriever

Fetch relevant documents for the query

Model or implementation: Contriever-MSMARCO

Generator

Generate answer by implicitly weighing internal vs. external knowledge

Model or implementation: LLaMA2-7B or LLaMA3-8B-Instruct (fine-tuned via RPO)

Novel Architectural Elements

Implicit retrieval evaluation mechanism embedded within the generator's weights via a modified reward function, removing the need for explicit evaluator modules during inference

Modeling

Base Model: LLaMA2-7B and LLaMA3-8B-Instruct

Training Method: Retrieval Preference Optimization (RPO)

Objective Functions:

Purpose: Maximize the margin between preferred and dispreferred answers while accounting for retrieval relevance.

Formally: Modified DPO loss including a retrieval-aware reward term derived from the probability ratio of answers with vs. without retrieval.

Training Data:

Subset 1: Instances where model fails without retrieval but succeeds with it (promotes reading retrieval)
Subset 2: Instances where model succeeds without retrieval but fails with it (promotes ignoring bad retrieval)
Total training set combines these to simulate knowledge conflict

Key Hyperparameters:

learning_rate: 5e-7 (RPO phase)
beta: 0.1 (DPO/RPO temperature)
batch_size: 64 (RPO phase)
+ 1 more
epochs: 2

Compute: Training performed on 4 NVIDIA A800-80G GPUs

Comparison to Prior Work

vs. Self-RAG/CRAG: RPO requires no extra inference steps (evaluator calls/web search) or special tokens, resulting in lower latency
vs. Standard DPO: RPO modifies the reward formulation to account for retrieval explicitly, whereas standard DPO assumes a static context
vs. TOG [not cited in paper]: TOG optimizes reasoning paths on knowledge graphs, whereas RPO focuses on the preference alignment of the generator in text-based RAG

Limitations

Depends on the availability of a labeled dataset where ground truth is known to construct preference pairs
Relies on the initial SFT model having some capability to distinguish correct answers to generate the training data
Only evaluated on short-form QA tasks (PopQA, NQ, TriviaQA, RGB), not long-form generation

Reproducibility

Code availability is not explicitly provided in the paper text. Paper describes data collection and filtering logic in detail (Algorithm 1). Baselines like Self-RAG and CRAG are cited standard implementations.

📊 Experiments & Results

Evaluation Setup

Open-domain QA using standard benchmarks

Benchmarks:

PopQA (Long-tail Entity QA)
Natural Questions (NQ) (Open-domain QA)
TriviaQA (Complex QA)
RGB (Robustness analysis (Noise/Negative Rejection))

Metrics:

Accuracy (Exact Match or containment of ground truth)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison using LLaMA3-8B-Instruct as the base model.
PopQA	Accuracy	57.4	63.8	+6.4
Natural Questions (NQ)	Accuracy	47.7	58.3	+10.6
TriviaQA	Accuracy	78.4	87.0	+8.6
Performance comparison using LLaMA2-hf-7b as the base model.
Natural Questions (NQ)	Accuracy	29.3	52.6	+23.3
RGB	Accuracy	48.2	53.9	+5.7
Efficiency comparison showing RPO matches standard RAG speed.
Inference Efficiency	LLM Calls	Multiple (Adaptive)	1	Reduced

Main Takeaways

RPO consistently outperforms standard RAG and adaptive RAG baselines (Self-RAG, CRAG) across multiple datasets and model sizes
The method is significantly more efficient than previous adaptive approaches because it integrates evaluation into the generation weights rather than requiring separate inference steps
Ablation studies confirm that both the specific data construction strategy (simulating conflict) and the modified RPO loss function contribute to the performance gains

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Retrieval-Augmented Generation (RAG)

Key Terms

RAG: Retrieval-Augmented Generation—systems that fetch external documents to help an LLM answer questions

Parametric Knowledge: Knowledge stored within the LLM's pre-trained weights (internal memory)

Non-parametric Knowledge: Knowledge provided explicitly in the input context (retrieved documents)

DPO: Direct Preference Optimization—an alignment algorithm that optimizes a policy to satisfy preferences without an explicit reward model

SFT: Supervised Fine-Tuning—training a model on labeled examples before applying preference optimization

PPO: Proximal Policy Optimization—an RL algorithm used in RLHF, often replaced by DPO for efficiency

Knowledge Conflict: Situations where the model's internal knowledge contradicts the information provided in the retrieved documents

Partition Function: A normalizing constant in probability distributions (often intractable to calculate) that appears in the DPO derivation