Evaluation Setup
Question Answering with retrieved context containing knowledge conflicts (counterfactuals)
Benchmarks:
- ConFiQA (Counterfactual QA (Single-hop, Multi-hop)) [New]
- Natural Questions (NQ) (Open-domain QA (modified for counterfactuals))
- MQuAKE (Multi-hop QA with in-context editing)
- TruthfulQA (Factuality evaluation)
Metrics:
- Pc (Context-faithful accuracy)
- Po (Original/Stubborn accuracy)
- MR (Memorization Ratio)
- EM (Exact Match)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Context-DPO significantly improves context-faithfulness (Pc) across all tested models on the ConFiQA benchmark compared to base models and prompt-based baselines. |
| ConFiQA |
Pc |
48.2 |
65.2 |
+17.0
|
| ConFiQA |
Pc |
39.5 |
70.4 |
+30.9
|
| ConFiQA |
Pc |
24.9 |
62.7 |
+37.8
|
| ConFiQA |
Pc |
18.6 |
70.7 |
+52.1
|
| Ablation against SFT and prompting strategies shows DPO is more effective than simple fine-tuning or prompting. |
| ConFiQA |
Pc |
52.7 |
65.2 |
+12.5
|
| Safety check on TruthfulQA ensures the model hasn't lost its general factuality. |
| TruthfulQA |
MC1 |
29.9 |
29.9 |
0.0
|
Main Takeaways
- Context-faithfulness degrades as models become larger and more capable (inverse scaling), likely due to stronger parametric memory.
- Context-DPO consistently outperforms prompt-based interventions (like 'Attr' and 'O&I') and standard Supervised Fine-Tuning (SFT).
- The alignment process separates the model's reliance on context vs. internal memory without damaging its general generative or factual capabilities on standard benchmarks.