← Back to Paper List

Systematic Knowledge Injection into Large Language Models via Diverse Augmentation for Domain-Specific RAG

K Bhushan, Y Nandwani, D Khandelwal, S Gupta…
Indian Institute of Technology (Indian School of Mines), Dhanbad, IBM Research
arXiv, 2/2025 (2025)
RAG QA Factuality

📝 Paper Summary

Modularized RAG pipeline Knowledge Internalization
PA-RAG improves domain-specific RAG by fine-tuning LLMs with paraphrased answer augmentation and mixed retrieval contexts to mitigate memorization bias and canonical answer overfitting.
Core Problem
Fine-tuning LLMs for domain-specific RAG suffers from 'conditional memorization bias' (inconsistent reliance on context vs. memory) and 'canonical answer overfitting' (memorizing fixed answer patterns rather than knowledge).
Why it matters:
  • Retrieval errors (failure to fetch relevant docs) cause hallucinations if the model cannot fall back on parametric knowledge
  • Current fine-tuning methods (like RAFT) force models to either ignore context or over-rely on it based on static training assignments
  • Singular answers in training data cause models to learn spurious stylistic patterns rather than internalizing the actual domain facts
Concrete Example: If a model is trained on a specific document only in 'retrieval success' scenarios, it fails to answer questions from that document when retrieval fails during testing. Conversely, if trained only with distractors, it ignores valid retrieved context later.
Key Novelty
PA-RAG (Paraphrase Augmentation for RAG)
  • Synthesizes multiple paraphrased answers for every training question to prevent the model from overfitting to a single 'canonical' response string
  • Simulates both retrieval success (relevant context) and failure (irrelevant context) for every question during training to teach the model exactly when to use context vs. memory
  • Uses 'self-selective' replay buffers (using the model's own predictions on old data) to prevent catastrophic forgetting of general capabilities
Evaluation Highlights
  • Achieves up to 77.6% token-level recall on domain-specific datasets, outperforming RAFT (70.6%) and standard RAG (68.9%)
  • Maintains general reasoning capabilities (GSM8k, MMLU) with negligible regression (-1.2% drop vs -5.4% for RAFT)
  • Scores 95.3% on Mixtral-Judge correctness metric, significantly higher than RAFT's 82.8%
Breakthrough Assessment
7/10
Strong practical improvements for domain adaptation in RAG. Effectively addresses specific failure modes of prior SOTA (RAFT) using clever data augmentation, though the core architecture remains standard.
×