Evaluation Setup
Controlled QA task with injected parametric knowledge and variable external evidence
Benchmarks:
- Custom Electronics QA Dataset (Knowledge Fusion QA) [New]
Metrics:
- Accuracy (R_acc)
- Information Coverage (R_cover)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance analysis of ChatGLM3-6B across four knowledge fusion scenarios (S1-S4) after knowledge injection. |
| Custom Electronics QA |
Accuracy (R_acc) |
37.14 |
93.33 |
+56.19
|
| Custom Electronics QA |
Accuracy (R_acc) |
37.14 |
51.33 |
+14.19
|
| Custom Electronics QA |
Accuracy (R_acc) |
37.14 |
41.67 |
+4.53
|
| Custom Electronics QA |
Knowledge Retention (Accuracy on Training Data) |
0.18 |
0.56 |
+0.38
|
Main Takeaways
- LLMs exhibit a 'recency bias' or over-reliance on external context: In Scenario S3 (useless external context), accuracy often drops compared to using no external context at all, as models are misled by noise.
- Knowledge Retention correlates with Fusion capability: ChatGLM3-6B retained more injected knowledge (56% vs 18% for Qwen) and consequently performed better in S2 (Partial) and S3 (Internal Only) scenarios.
- Scenario S2 (Partial Evidence) is the most challenging: Models fail to seamlessly stitch partial external clues with internal facts, often underperforming compared to simple retrieval (S1).