Context-DPO: Aligning Language Models for Context-Faithfulness

📝 Paper Summary

Modularized RAG pipeline Answer generation

Context-DPO improves how strictly LLMs adhere to retrieved context by training them on preference pairs where the model prefers reasoning based on provided (potentially counterfactual) information over its internal memory.

Core Problem

LLMs often ignore retrieved context when it conflicts with their internal parametric knowledge or fail to follow retrieval instructions strictly.

Why it matters:

RAG systems fail when models 'stubbornly' rely on outdated or incorrect internal memory instead of new retrieved evidence
Existing solutions rely on fragile prompt engineering or decoding hacks rather than fundamentally aligning the model's behavior
As models become larger and more capable, they paradoxically become less faithful to context due to higher confidence in their own training data

Concrete Example: If a user provides a retrieved document stating 'The capital of France is Mars' (a counterfactual scenario), a standard LLM will often ignore this and answer 'Paris' based on its internal knowledge, failing the user's implicit instruction to use the provided source.

Key Novelty

Context-Faithful Direct Preference Optimization (Context-DPO)

Constructs a dataset of 'faithful' responses (derived from counterfactual context) and 'stubborn' responses (derived from internal factual memory)
Uses Direct Preference Optimization (DPO) to explicitly align the model to prefer the 'faithful' response over the 'stubborn' one given the context
Introduces ConFiQA, a benchmark for measuring this faithfulness using controlled knowledge conflicts

Architecture

The Context-DPO framework showing data construction and the DPO training process.

Evaluation Highlights

+280% improvement in context-faithfulness (Pc) for Qwen2-7B-instruct on ConFiQA compared to the base model
+151% improvement for Mistral-7B-instruct-v0.2 on ConFiQA benchmarks
Maintains factual generation capabilities (TruthfulQA) within 1% of the original model, showing that alignment does not degrade general knowledge

Breakthrough Assessment

7/10

Strong empirical results on a specific, critical RAG failure mode. The method is straightforward and effective, though the scope is primarily focused on conflicting context scenarios.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where retrieved context C may conflict with parametric knowledge within the model

Inputs: Context C (potentially counterfactual) and Question Q

Outputs: Response A that is faithful to C

Pipeline Flow

Input: Question + Context
Context-DPO Model (Fine-tuned via DPO)
Output: Faithful Response

System Modules

Context-DPO Model

Generate answer faithful to context

Model or implementation: Llama-2/3, Mistral, Qwen2 (various sizes)

Modeling

Base Model: Llama2-7B-chat, Llama2-13B-chat, Mistral-7B-instruct-v0.2, Qwen2-7B-instruct, Llama3-8B

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to assign higher probability to faithful responses over stubborn ones.

Formally: L_DPO(π_θ; π_ref) = -E_{(x, y_w, y_l) ~ D} [log σ( β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)) )]

Adaptation: Full fine-tuning (implied by lack of LoRA mention, but not explicitly specified)

Training Data:

Uses ConFiQA dataset for preference pairs
Faithful response (y_w): Generated via reasoning chain based on counterfactual context
Stubborn response (y_l): Generated via reasoning chain based on factual reality (internal memory)

Key Hyperparameters:

beta: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Attr/O&I: Context-DPO modifies model weights via alignment rather than relying on inference-time prompting
vs. SFT: Context-DPO uses negative samples (stubborn responses) to explicitly penalize ignoring context, whereas SFT only reinforces positive examples
vs. CAD (Context-Aware Decoding) [not cited in paper]: Context-DPO is a training method, whereas CAD modifies the decoding objective at inference time to amplify context influence

Limitations

Evaluation primarily focuses on counterfactual scenarios, which may not perfectly reflect all real-world RAG nuances (e.g., subtle ambiguity)
Trade-off between context-faithfulness and hallucination risk on clean inputs is evaluated but remains a delicate balance
Requires constructing specific preference pairs (faithful vs. stubborn) which relies on having ground truth knowledge to generate the 'stubborn' negative

Reproducibility

Code: https://github.com/byronBBL/Context-DPO

Code and data released at https://github.com/byronBBL/Context-DPO. Hyperparameters like learning rate and beta are missing from the text.

📊 Experiments & Results

Evaluation Setup

Question Answering with retrieved context containing knowledge conflicts (counterfactuals)

Benchmarks:

ConFiQA (Counterfactual QA (Single-hop, Multi-hop)) [New]
Natural Questions (NQ) (Open-domain QA (modified for counterfactuals))
MQuAKE (Multi-hop QA with in-context editing)
TruthfulQA (Factuality evaluation)

Metrics:

Pc (Context-faithful accuracy)
Po (Original/Stubborn accuracy)
MR (Memorization Ratio)
EM (Exact Match)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Context-DPO significantly improves context-faithfulness (Pc) across all tested models on the ConFiQA benchmark compared to base models and prompt-based baselines.
ConFiQA	Pc	48.2	65.2	+17.0
ConFiQA	Pc	39.5	70.4	+30.9
ConFiQA	Pc	24.9	62.7	+37.8
ConFiQA	Pc	18.6	70.7	+52.1
Ablation against SFT and prompting strategies shows DPO is more effective than simple fine-tuning or prompting.
ConFiQA	Pc	52.7	65.2	+12.5
Safety check on TruthfulQA ensures the model hasn't lost its general factuality.
TruthfulQA	MC1	29.9	29.9	0.0

Experiment Figures

An illustration of the 'Context-Faithfulness' problem where an LLM ignores retrieved info (e.g., 'The capital of France is Mars') and answers based on memory ('Paris').

Main Takeaways

Context-faithfulness degrades as models become larger and more capable (inverse scaling), likely due to stronger parametric memory.
Context-DPO consistently outperforms prompt-based interventions (like 'Attr' and 'O&I') and standard Supervised Fine-Tuning (SFT).
The alignment process separates the model's reliance on context vs. internal memory without damaging its general generative or factual capabilities on standard benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Knowledge of DPO (Direct Preference Optimization)
Familiarity with 'knowledge conflicts' in LLMs

Key Terms

context-faithfulness: The degree to which an LLM's response adheres to the provided retrieved information, especially when it contradicts the model's internal training data

counterfactual context: Artificial context created to contradict real-world facts (e.g., saying 'Paris is in Germany') to test if a model follows context or memory

stubborn response: A response where the model ignores the provided context and answers based on its pre-trained internal knowledge

faithful response: A response that correctly incorporates and reasons over the provided context, even if that context is factually false in the real world

DPO: Direct Preference Optimization—an alignment method that optimizes a policy to match preference data (winner/loser pairs) without training an explicit reward model

ConFiQA: ContextFaithfulnessQuestionAnswering—a new benchmark proposed in this paper containing single-hop and multi-hop questions with counterfactual contexts