Evaluating the External and Parametric Knowledge Fusion of Large Language Models

📝 Paper Summary

RAG analysis Knowledge Fusion

The paper constructs a controlled evaluation pipeline to analyze how LLMs fuse internal parametric memory with external retrieved evidence across four distinct scenarios of knowledge sufficiency and conflict.

Core Problem

Current RAG evaluations often assume external knowledge is perfect or irrelevant, neglecting complex scenarios where external evidence is partial, noisy, or complementary to the LLM's internal memory.

Why it matters:

External retrieval in real-world applications is often incomplete or noisy, requiring models to intelligently fill gaps with internal knowledge
Over-reliance on external knowledge (a common RAG behavior) can suppress valuable internal knowledge, leading to failures when retrieval is sub-optimal
Existing studies struggle to fairly evaluate 'parametric knowledge' because it varies wildly between models and is hard to measure precisely

Concrete Example: If a user asks about a phone's camera specs, and the retrieval provides the pixel count but misses the sensor type (which the model knows from training), a standard RAG model might ignore its internal knowledge and give an incomplete answer based only on the retrieved text.

Key Novelty

Systematic Knowledge Fusion Evaluation Pipeline

Defines four explicit fusion scenarios (S1-S4) based on the sufficiency of external vs. internal knowledge (e.g., partial external info requires internal supplementation)
Constructs a controlled dataset by injecting specific 'outdated' knowledge into models via fine-tuning (simulating parametric memory) and using 'latest' data as external evidence
Isolates the variable of 'parametric knowledge' by training models on the specific facts needed for evaluation, ensuring a fair baseline for fusion capability

Architecture

The systematic pipeline for data construction and knowledge infusion.

Evaluation Highlights

Integrating external and parametric knowledge significantly boosts performance: Accuracy improves from ~37% (parametric only) to ~93% (fusion) in optimal scenarios (S1) for ChatGLM3-6B
LLMs struggle with partial information (S2): Accuracy drops significantly (e.g., to ~45% for Qwen-7B) when external evidence is incomplete and requires internal supplementation compared to full evidence
Knowledge retention heavily impacts fusion: Models with poor fine-tuning retention of parametric knowledge show drastic performance drops in fusion tasks, with ChatGLM3-6B showing higher retention and better fusion than Qwen-7B

Breakthrough Assessment

7/10

Provides a rigorous, much-needed framework for evaluating RAG beyond simple 'retrieval accuracy.' The methodology of injecting parametric knowledge to control variables is clever, though the scope is limited to specific domains.

⚙️ Technical Details

Problem Definition

Setting: Question Answering where the answer depends on the interplay between retrieved external evidence (Ke) and internal parametric memory (Kp)

Inputs: A question (q) and retrieved evidence (Ke) which may be complete, partial, irrelevant, or noisy

Outputs: An answer (a) derived from the fusion of Ke and Kp

Pipeline Flow

Data Collection (Electronics Domain)
Data Partitioning (Latest vs. Outdated)
Parametric Injection (Fine-tuning LLM on Outdated Data)
Scenario Construction (Creating QA pairs for S1-S4)
Inference & Evaluation

System Modules

Knowledge Injector

Inject 'outdated' knowledge into the LLM to establish a controlled Kp (Parametric Knowledge)

Model or implementation: ChatGLM3-6B / Qwen-7B (Fine-tuned)

Scenario Generator

Create evaluation samples corresponding to S1-S4 by mixing 'latest' (Ke) and 'outdated' (Kp) data snippets

Model or implementation: GPT-4 (as data generator)

Fusion Evaluator

Query the fine-tuned LLM with constructed prompts containing varying quality of external evidence

Model or implementation: Fine-tuned LLM

Novel Architectural Elements

Controlled Knowledge Injection Pipeline: Methodologically splitting data into 'latest' (Ke) and 'outdated' (Kp) and explicitly training the model on Kp to standardize internal memory for evaluation purposes

Modeling

Base Model: ChatGLM3-6B and Qwen-7B

Training Method: Supervised Fine-Tuning (SFT) with knowledge augmentation

Adaptation: LoRA (Low-Rank Adaptation) for efficient fine-tuning

Trainable Parameters: Not reported in the paper

Training Data:

800 paragraphs of 'outdated' electronics data for parametric injection
Augmented with 8 GPT-4 generated QA pairs per snippet
Total training set: 630 samples (QA pairs derived from injection data)

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 1 (with gradient accumulation steps = 4)
lora_rank: 8
+ 5 more
lora_alpha: 32
lora_dropout: 0.1
max_source_length: 256
max_target_length: 256
epochs: Not explicitly reported in the paper

Compute: Single NVIDIA A800 GPU

Comparison to Prior Work

vs. Standard RAG evaluations (e.g., RGB,RGB++): This paper explicitly injects parametric knowledge to control the Kp variable, whereas others assume pre-existing knowledge [not cited in paper]
vs. Yoran et al. (2023): This paper considers 'partial' knowledge scenarios (S2), whereas prior work mostly focuses on binary relevant/irrelevant external knowledge
vs. Self-RAG: Focuses on the *evaluation* of the fusion capability across scenarios rather than proposing a new architectural solution for fusion

Limitations

Evaluation is limited to the electronics product domain, which may not generalize to abstract reasoning tasks
Relies on 'knowledge injection' via fine-tuning, which might differ from knowledge acquired during massive pre-training
The distinction between 'latest' and 'outdated' is binary, whereas real-world knowledge decay is continuous
Sample sizes for evaluation are relatively small (approx 300 test samples)

Reproducibility

Code: https://github.com/RUC-NLPIR/KnowledgeFusion

Code and data construction pipeline are publicly available. The dataset includes training/validation/test splits (630/300/300 samples). GPT-4-0613 is used for data generation and as a black-box baseline.

📊 Experiments & Results

Evaluation Setup

Controlled QA task with injected parametric knowledge and variable external evidence

Benchmarks:

Custom Electronics QA Dataset (Knowledge Fusion QA) [New]

Metrics:

Accuracy (R_acc)
Information Coverage (R_cover)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance analysis of ChatGLM3-6B across four knowledge fusion scenarios (S1-S4) after knowledge injection.
Custom Electronics QA	Accuracy (R_acc)	37.14	93.33	+56.19
Custom Electronics QA	Accuracy (R_acc)	37.14	51.33	+14.19
Custom Electronics QA	Accuracy (R_acc)	37.14	41.67	+4.53
Custom Electronics QA	Knowledge Retention (Accuracy on Training Data)	0.18	0.56	+0.38

Experiment Figures

Conceptual diagram of the four knowledge fusion scenarios (S1-S4).

Distribution of evidence pieces per sample in the dataset.

Main Takeaways

LLMs exhibit a 'recency bias' or over-reliance on external context: In Scenario S3 (useless external context), accuracy often drops compared to using no external context at all, as models are misled by noise.
Knowledge Retention correlates with Fusion capability: ChatGLM3-6B retained more injected knowledge (56% vs 18% for Qwen) and consequently performed better in S2 (Partial) and S3 (Internal Only) scenarios.
Scenario S2 (Partial Evidence) is the most challenging: Models fail to seamlessly stitch partial external clues with internal facts, often underperforming compared to simple retrieval (S1).

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Concept of Parametric Memory (knowledge stored in model weights)
Fine-tuning techniques (LoRA/SFT) for knowledge injection

Key Terms

Parametric Knowledge (Kp): Information stored within the LLM's weights acquired during pre-training or fine-tuning

External Knowledge (Ke): Information provided to the LLM via the input context (e.g., from retrieval)

Knowledge Fusion: The process where an LLM integrates retrieved information with its internal memory to generate a complete answer

S1 (Scenario 1): External knowledge alone is sufficient to answer; internal knowledge is not strictly needed

S2 (Scenario 2): External knowledge is partial; internal knowledge must fill the gaps

S3 (Scenario 3): External knowledge is useless/noisy; answer depends solely on internal knowledge

S4 (Scenario 4): Neither external nor internal knowledge is sufficient; model should refuse to answer

SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to specific tasks or inject knowledge