Systematic Knowledge Injection into Large Language Models via Diverse Augmentation for Domain-Specific RAG

📝 Paper Summary

Modularized RAG pipeline Knowledge Internalization

PA-RAG improves domain-specific RAG by fine-tuning LLMs with paraphrased answer augmentation and mixed retrieval contexts to mitigate memorization bias and canonical answer overfitting.

Core Problem

Fine-tuning LLMs for domain-specific RAG suffers from 'conditional memorization bias' (inconsistent reliance on context vs. memory) and 'canonical answer overfitting' (memorizing fixed answer patterns rather than knowledge).

Why it matters:

Retrieval errors (failure to fetch relevant docs) cause hallucinations if the model cannot fall back on parametric knowledge
Current fine-tuning methods (like RAFT) force models to either ignore context or over-rely on it based on static training assignments
Singular answers in training data cause models to learn spurious stylistic patterns rather than internalizing the actual domain facts

Concrete Example: If a model is trained on a specific document only in 'retrieval success' scenarios, it fails to answer questions from that document when retrieval fails during testing. Conversely, if trained only with distractors, it ignores valid retrieved context later.

Key Novelty

PA-RAG (Paraphrase Augmentation for RAG)

Synthesizes multiple paraphrased answers for every training question to prevent the model from overfitting to a single 'canonical' response string
Simulates both retrieval success (relevant context) and failure (irrelevant context) for every question during training to teach the model exactly when to use context vs. memory
Uses 'self-selective' replay buffers (using the model's own predictions on old data) to prevent catastrophic forgetting of general capabilities

Evaluation Highlights

Achieves up to 77.6% token-level recall on domain-specific datasets, outperforming RAFT (70.6%) and standard RAG (68.9%)
Maintains general reasoning capabilities (GSM8k, MMLU) with negligible regression (-1.2% drop vs -5.4% for RAFT)
Scores 95.3% on Mixtral-Judge correctness metric, significantly higher than RAFT's 82.8%

Breakthrough Assessment

7/10

Strong practical improvements for domain adaptation in RAG. Effectively addresses specific failure modes of prior SOTA (RAFT) using clever data augmentation, though the core architecture remains standard.

⚙️ Technical Details

Problem Definition

Setting: Domain-specific Question Answering where a fixed corpus D is available for both fine-tuning and retrieval.

Inputs: Domain-specific corpus D, question q, retrieved documents

Outputs: Answer a generated either from context (if relevant) or parametric memory (if retrieval fails)

Pipeline Flow

Synthetic Data Generation (LLM creates QA pairs from chunks)
Answer Paraphrasing (LLM generates multiple answer variations per question)
Data Construction (Assign QA pairs to 'retriever success' or 'failure' buckets)
Fine-tuning (Train LLM on augmented data + Replay Buffer + Domain Identifier)

System Modules

Synthetic QA Generator (Data Preparation)

Generate initial QA pairs from document chunks

Model or implementation: Mixtral-8x22B-Instruct-v0.1

Paraphraser (Data Preparation)

Generate multiple answer variations for each question

Model or implementation: Mixtral-8x22B-Instruct-v0.1

Fine-Tuned LLM

Generate final answer using retrieved context or memory

Model or implementation: Mistral-7B-Instruct-v0.2 (Base)

Modeling

Base Model: Mistral-7B-Instruct-v0.2

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Adaptation: LoRA (rank=16 for Book1, rank=32 for Book2)

Trainable Parameters: LoRA adapters

Training Data:

Book 1: 5 chapters, 18,986 augmented QA pairs
Book 2: 6 chapters, 126,213 augmented QA pairs

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 256 (effective)
lora_rank: 16 or 32
+ 1 more
max_steps: 400 (Book1) / 1200 (Book2)

Compute: 4 A100 80GB GPUs; Training time < 5 hours (Book1) and < 15 hours (Book2)

Comparison to Prior Work

vs. RAFT: PA-RAG uses multiple paraphrased answers per question (vs. one) and implicitly augments context via random bucket assignment of paraphrases
vs. DSF: PA-RAG trains with retrieval context (both relevant and irrelevant), enabling the model to decide when to use context
vs. CA-RAFT (Context-Augmented RAFT): PA-RAG adds answer multiplicity and domain identifiers on top of context augmentation logic

Limitations

Does not completely eliminate catastrophic forgetting, only reduces it
Relies on a strong teacher LLM (Mixtral-8x22B) to generate high-quality synthetic QA pairs
Paraphrased answer augmentation increases training data size and computational cost

Reproducibility

Code: https://github.com/kushagrabhushan/Systematic-Knowledge-Injection

Code and datasets are publicly available. Prompts for data generation and evaluation are provided in the appendix.

📊 Experiments & Results

Evaluation Setup

RAG QA on two domain-specific books (Redbooks published in 2024, unseen by base model)

Benchmarks:

Book 1 (IBM Storage FlashSystem) (Domain-specific QA) [New]
Book 2 (Red Hat OpenShift) (Domain-specific QA) [New]
MMLU / GSM8k / Hellaswag / TruthfulQA (General capabilities (Regression testing))

Metrics:

Token-level Recall
LLM Judge (Mixtral-8x22B, LLaMA-3.3-70B)
Regression Score (avg of general benchmarks)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on Book 1 showing PA-RAG superiority in both knowledge injection (Recall) and preserving general skills (Regression Score).
Book 1	Token-level Recall	70.6	77.6	+7.0
Book 1	Mixtral-Judge	82.8	95.3	+12.5
Book 1	Regression Score Drop	-5.4	-1.2	+4.2
Performance breakdown when Retriever Fails ('No Overlap' subset) vs. Succeeds ('Some Overlap'). PA-RAG excels in both.
Book 1 (No Overlap)	Token-level Recall	57.7	73.4	+15.7
Book 1 (Some Overlap)	Token-level Recall	73.0	80.3	+7.3
Ablation study demonstrating the impact of removing components from PA-RAG.
Book 1	Token-level Recall	77.6	72.0	-5.6
Book 1	Regression Score Drop	-1.2	-1.9	-0.7

Main Takeaways

PA-RAG successfully injects new domain knowledge while preserving general capabilities better than baselines.
Answer multiplicity is critical: ablating it causes the largest drop in recall, proving that standard fine-tuning overfits to specific answer strings.
The method is robust to 'conditional memorization bias': it learns to ignore irrelevant context and use relevant context effectively, unlike RAFT which can struggle with inconsistent retrieval scenarios.
Generalizes across model families: Improvements hold for both Mistral-7B and Llama-2 (7B/13B).

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) concepts
Supervised Fine-Tuning (SFT) of LLMs
Catastrophic forgetting in continual learning

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

RAFT: Retrieval Augmented Fine-Tuning—a method that fine-tunes LLMs to ignore distractor documents

conditional memorization bias: A failure mode where an LLM learns to rely on context or memory based on static training data assignments rather than the actual relevance of the text

canonical answer overfitting: When an LLM memorizes the specific phrasing of a single ground-truth answer instead of the underlying semantic knowledge

replay buffer: A collection of examples from previous tasks used during training to prevent the model from forgetting earlier skills

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique

nucleus sampling: A decoding strategy that samples from the smallest set of top-v tokens whose cumulative probability exceeds a threshold p

catastrophic forgetting: The tendency of neural networks to abruptly forget previously learned information upon learning new information

domain identifier: A specific token or phrase prepended to inputs to signal the model to switch to a specific domain context