← Back to Paper List

Does fine-tuning LLMs on new knowledge encourage hallucinations?

(Technion, Google) Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, Jonathan Herzig
arXiv, 5/2024 (2024)
Factuality QA

📝 Paper Summary

Knowledge Internalization Fine-tuning dynamics
This study investigates how fine-tuning an LLM on new factual knowledge affects its performance on facts it already knew, facts it didn't know, and facts it held incorrect beliefs about.
Core Problem
It is unclear how fine-tuning on new datasets interacts with an LLM's pre-training knowledge—specifically, whether it reinforces known facts, teaches unknown ones, or induces hallucinations.
Why it matters:
  • Fine-tuning is standard for adapting models, but if it degrades performance on known facts or fails to reliably teach unknown ones, its utility for knowledge updates is limited.
  • Understanding 'knowledge conflicts' (where fine-tuning data contradicts pre-training beliefs) is crucial for building reliable, up-to-date AI systems.
  • Blindly fine-tuning on large corpora without knowing if the model actually learns the underlying facts or just mimics surface patterns leads to unreliable deployment.
Concrete Example: A model might know 'Paris is in France' (Known) but not 'Benedict is in Hubbard County' (Unknown). If fine-tuned on a dataset containing both types, does it actually learn the location of Benedict, or does it just overfit? Does it forget Paris? The paper categorizes these scenarios to measure exact outcomes.
Key Novelty
Controlled Knowledge Categorization Framework
  • Classifies training examples into four categories based on the pre-trained model's prior knowledge: Known (always correct), MaybeKnown (sometimes correct), WeaklyKnown (correct only with temperature > 0), and Unknown (never correct).
  • Constructs controlled fine-tuning datasets with varying proportions of these categories to isolate the specific effects of 'teaching' new knowledge versus 'reinforcing' old knowledge.
Evaluation Highlights
  • Establishment of a rigorous 4-category taxonomy (Known, MaybeKnown, WeaklyKnown, Unknown) to predict fine-tuning outcomes based on pre-training behavior.
  • Demonstrates that the model's ability to learn from fine-tuning is heavily dependent on whether the knowledge was already present (Known/WeaklyKnown) or entirely absent (Unknown) in pre-training.
  • Uses Exact Match (EM) metric on PaLM 2-S to quantify these shifts, validating that EM correlates strongly with F1 in this factual setting.
Breakthrough Assessment
7/10
Provides a valuable, granular taxonomy for understanding fine-tuning dynamics on factual knowledge. While not a new architecture, the experimental design offers significant insights into the 'black box' of knowledge updates.
×