Does fine-tuning LLMs on new knowledge encourage hallucinations?

📝 Paper Summary

Knowledge Internalization Fine-tuning dynamics

This study investigates how fine-tuning an LLM on new factual knowledge affects its performance on facts it already knew, facts it didn't know, and facts it held incorrect beliefs about.

Core Problem

It is unclear how fine-tuning on new datasets interacts with an LLM's pre-training knowledge—specifically, whether it reinforces known facts, teaches unknown ones, or induces hallucinations.

Why it matters:

Fine-tuning is standard for adapting models, but if it degrades performance on known facts or fails to reliably teach unknown ones, its utility for knowledge updates is limited.
Understanding 'knowledge conflicts' (where fine-tuning data contradicts pre-training beliefs) is crucial for building reliable, up-to-date AI systems.
Blindly fine-tuning on large corpora without knowing if the model actually learns the underlying facts or just mimics surface patterns leads to unreliable deployment.

Concrete Example: A model might know 'Paris is in France' (Known) but not 'Benedict is in Hubbard County' (Unknown). If fine-tuned on a dataset containing both types, does it actually learn the location of Benedict, or does it just overfit? Does it forget Paris? The paper categorizes these scenarios to measure exact outcomes.

Key Novelty

Controlled Knowledge Categorization Framework

Classifies training examples into four categories based on the pre-trained model's prior knowledge: Known (always correct), MaybeKnown (sometimes correct), WeaklyKnown (correct only with temperature > 0), and Unknown (never correct).
Constructs controlled fine-tuning datasets with varying proportions of these categories to isolate the specific effects of 'teaching' new knowledge versus 'reinforcing' old knowledge.

Evaluation Highlights

Establishment of a rigorous 4-category taxonomy (Known, MaybeKnown, WeaklyKnown, Unknown) to predict fine-tuning outcomes based on pre-training behavior.
Demonstrates that the model's ability to learn from fine-tuning is heavily dependent on whether the knowledge was already present (Known/WeaklyKnown) or entirely absent (Unknown) in pre-training.
Uses Exact Match (EM) metric on PaLM 2-S to quantify these shifts, validating that EM correlates strongly with F1 in this factual setting.

Breakthrough Assessment

7/10

Provides a valuable, granular taxonomy for understanding fine-tuning dynamics on factual knowledge. While not a new architecture, the experimental design offers significant insights into the 'black box' of knowledge updates.

⚙️ Technical Details

Problem Definition

Setting: Closed-book Question Answering (QA) where knowledge must be retrieved from model weights

Inputs: Knowledge-seeking question q (e.g., 'Where is Paris located?')

Outputs: Exact ground-truth answer a (e.g., 'France')

Pipeline Flow

Knowledge Categorization (Pre-experiment)
Dataset Construction
Fine-tuning
Evaluation

System Modules

Knowledge Categorizer

Classify potential training/test examples into 4 distinct groups based on the pre-trained model's behavior

Model or implementation: PaLM 2-S (Pre-trained)

Fine-tuner

Update model weights on constructed datasets D containing specific mixes of knowledge categories

Model or implementation: PaLM 2-S

Evaluator

Assess performance of M_D on test sets

Model or implementation: Fine-tuned PaLM 2-S

Novel Architectural Elements

Taxonomy-based dataset construction: Creating training sets explicitly balanced by the model's *prior* knowledge state (Known/Unknown/etc.) rather than just random sampling.

Modeling

Base Model: PaLM 2-S (base model)

Training Method: Supervised Fine-Tuning (SFT)

Trainable Parameters: Full fine-tuning (implied context of 'fine-tuning M on D')

Training Data:

Source: Entity_Questions dataset (derived from Wikidata)
Converted (subject, relation, object) triplets into QA pairs
Sub-sampled train split to create variants with specific proportions of Unknown vs. Known examples
12 relations used for training/ID testing; 7 reserved for OOD testing

Key Hyperparameters:

model_size: PaLM 2-S
decoding_temperature_for_categorization: T=0 (greedy) and T>0 (sampling)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Fine-tuning papers: Segments data by *prior model knowledge* rather than just domain or task type.
vs. RAG: Focuses entirely on parametric knowledge (internalization) rather than non-parametric retrieval.

Limitations

Relies on Exact Match (EM), which may penalize correct but phrased-differently answers (though authors claim correlation with F1).
Categorization depends on the specific prompting and temperature settings used during the 'diagnosis' phase.
Study uses PaLM 2-S; results might vary for significantly larger or smaller models with different memorization capacities.

Reproducibility

Artifacts used: Entity_Questions dataset (public), PaLM 2-S (proprietary/API access typically). Specific fine-tuning hyperparameters (LR, batch size) are not detailed in the provided text snippet. Code URL is not provided in the snippet.

📊 Experiments & Results

Evaluation Setup

Closed-book Question Answering on factual triplets

Benchmarks:

Entity_Questions (Custom Split) (Single-hop Factual QA) [New]

Metrics:

Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The provided text is a setup description and does not contain the final numeric results table. The categorization table provided defines the inputs, not the outputs. Therefore, specific performance deltas cannot be extracted from this snippet.

Main Takeaways

Defining knowledge not just as 'known/unknown' but with intermediate states (WeaklyKnown, MaybeKnown) allows for finer-grained analysis of fine-tuning effects.
The 'WeaklyKnown' category (correct only with sampling) represents a latent knowledge state that might be easily surfaced with fine-tuning compared to 'Unknown' facts.
The experimental design explicitly separates In-Distribution (12 relations) from Out-Of-Distribution (7 relations) to test generalization capabilities.

📚 Prerequisite Knowledge

Prerequisites

Language Model Fine-tuning
Greedy Decoding vs. Sampling
Knowledge Injection/Editing

Key Terms

Known (category): Questions where the pre-trained model consistently predicts the correct answer with greedy decoding (Temperature=0).

MaybeKnown (category): Questions where greedy decoding gives a probability between 0 and 1 for the correct answer (inconsistent).

WeaklyKnown (category): Questions where greedy decoding fails (score=0), but random sampling (Temperature > 0) sometimes finds the correct answer.

Unknown (category): Questions where the model never predicts the correct answer, even with sampling; suggests total lack of knowledge.

Greedy decoding: A generation strategy where the model always picks the single most likely next token.

Temperature sampling: A generation strategy where the model picks the next token randomly based on probabilities, allowing for more diverse (and potentially correct) outputs if the 'correct' token wasn't the absolute top choice.

Exact Match (EM): Evaluation metric checking if the generated text is identical to the ground truth.

Entity_Questions: The specific dataset used, derived from Wikidata triples transformed into QA pairs.

PaLM 2-S: The specific size of the PaLM 2 model used as the base model in experiments.